I am a computational biologist with six years of wet lab experience and over ten years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter! https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources
Hello Bioinformatics lovers, **The format for the interval figures was messed up. I updated them** Also, I expanded this newsletter to a full blog post with practical examples and solving real-world problems. See here:https://divingintogeneticsandgenomics.com/post/genomic-interval/ Also, watch how I install bedtools with conda here. What's the most common problem you need to solve when dealing with genomics data? For me, it is Genomic Intervals! The genomics data usually represents linearly: chromosome name, start and end. We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call ( a single point), and a duplication region in cancer etc. When I first started to learn programming 12 years ago in a wet lab, my task was to find where a set of peaks (from ChIP-seq) bind to genes. To solve this, we have two files (dummy example below):
The task is "easy", find the overlaps of those two files (in fact, you can eyeball this example, peak1 binds to gene1, peak3 binds to gene3)! As a beginner, I did not know much, so I read in both files with Python, loop over the lines and compare: These are a lot of conditions to compare! instead, we can find the conditions that the two regions do not overlap:
The comparison will be: if NOT ((start2 > end1) or (start1 > end2)): then two regions overlap! My brute force method works! and I felt accomplished as a beginner. As I become a little more experienced, I get to know the interval tree data structure which makes those types of comparisons much faster and efficient. In 2010, bedtools was published! and in a single command (bedtools intersect) you can accomplish what I did with my Python script. Remember, I wrote my script in 2012, two years after bedtools was published. The problem was I did not know this tool even existed! As a beginner, ignorance of what's out there is the price to pay. (wink, follow me on X https://x.com/tangming2005, I tweet tools and papers) My story was not alone. Thursday, I had the pleasure to have dinner with Dr. Ting Wang. We invited him to give a talk at our company. He told me that in the early days, he wrote a Perl script to do the intersection of genomic regions and found TP53 binds to Transposable elements(TE). see his paper https://pubmed.ncbi.nlm.nih.gov/18003932/ in 2007. Of course, Ting was a formally trained PhD in bioinformatics and I am sure his Perl script is much better than my crappy one. But this tells you how common this type of analysis is, and bedtools comes to the rescue in 2010. Later, I started to learn more about the Bioconductor ecosystem and learned the GenomicRanges package which is the foundation of dealing with genomic intervals. Action item: I highly recommend you learn how to use bedtools and GenomicRanges. That's the story for today, Stay tunned for my next. Happy Learning! Tommy PS: If you want to learn Bioinformatics, there are four ways that I can help:
Stay awesome! |
I am a computational biologist with six years of wet lab experience and over ten years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter! https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources