profile

Hi! I'm Tommy Tang

Resending with real-world examples: You need to master this if you deal with genomics data


Hello Bioinformatics lovers,

**The format for the interval figures was messed up. I updated them**

Also, I expanded this newsletter to a full blog post with practical examples and solving real-world problems. See here:https://divingintogeneticsandgenomics.com/post/genomic-interval/

Also, watch how I install bedtools with conda here.

What's the most common problem you need to solve when dealing with genomics data?

For me, it is Genomic Intervals!

The genomics data usually represents linearly: chromosome name, start and end.

We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call ( a single point), and a duplication region in cancer etc.

When I first started to learn programming 12 years ago in a wet lab, my task was to find where a set of peaks (from ChIP-seq) bind to genes. To solve this, we have two files (dummy example below):

  1. a peak file with chr, start, end for each row
    chr1 200 300 peak1
    chr2 400 500 peak2
    chr3 456 888 peak3
    .....
  2. A gene file also has chr, start, end, and name for each row denoting the gene's transcription start sites (TSS) + 50 bp upstream and downstream:
    chr1 250 350 gene1
    chr2 600 700 gene2
    chr3 700 800 gene3
    ....

The task is "easy", find the overlaps of those two files (in fact, you can eyeball this example, peak1 binds to gene1, peak3 binds to gene3)!

As a beginner, I did not know much, so I read in both files with Python, loop over the lines and compare:
For two regions to overlap, chr should be the same and there could be the following 4 conditions:

These are a lot of conditions to compare! instead, we can find the conditions that the two regions do not overlap:

The comparison will be: if NOT ((start2 > end1) or (start1 > end2)): then two regions overlap! My brute force method works! and I felt accomplished as a beginner.

As I become a little more experienced, I get to know the interval tree data structure which makes those types of comparisons much faster and efficient.

In 2010, bedtools was published! and in a single command (bedtools intersect) you can accomplish what I did with my Python script.

Remember, I wrote my script in 2012, two years after bedtools was published.

The problem was I did not know this tool even existed!

As a beginner, ignorance of what's out there is the price to pay. (wink, follow me on X https://x.com/tangming2005, I tweet tools and papers)

My story was not alone. Thursday, I had the pleasure to have dinner with Dr. Ting Wang. We invited him to give a talk at our company.

He told me that in the early days, he wrote a Perl script to do the intersection of genomic regions and found TP53 binds to Transposable elements(TE). see his paper https://pubmed.ncbi.nlm.nih.gov/18003932/ in 2007.

Of course, Ting was a formally trained PhD in bioinformatics and I am sure his Perl script is much better than my crappy one.

But this tells you how common this type of analysis is, and bedtools comes to the rescue in 2010.

Later, I started to learn more about the Bioconductor ecosystem and learned the GenomicRanges package which is the foundation of dealing with genomic intervals.

Action item: I highly recommend you learn how to use bedtools and GenomicRanges.

That's the story for today, Stay tunned for my next.

Happy Learning!

Tommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
  4. Lastly, I have a book called "From Cell Line to Command Line" to teach you bioinformatics.

Stay awesome!

Hi! I'm Tommy Tang

I am a computational biologist with six years of wet lab experience and over ten years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter! https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources

Share this page