profile

Hi! I'm Tommy Tang

Why Bioinformatics is complicated


Hello Bioinformatics lovers,

I can not believe it is already the second month of 2025!

I want to grab every opportunity to remind you that it is never too late to start learning. ( I wrote this tutorial to show you how to analyze TCGA bulk RNAseq data, Enjoy! )

For those who celebrate the Lunar New Year, Happy New Year!

It is the year of the snake. My wife wrote the couplets:

Today we will talk about the complexity of bioinformatics.

Why is bioinformatics so complicated? Because biology is.

Here’s a quick example to show just how nuanced even a "simple" analysis can be.

Genes aren’t simple entities. Most genes have multiple transcripts.

Different transcripts can have unique TSS(transcription start site), TES,(transcription end site) and exon compositions.

This complexity cascades.

Take DNA methylation and gene expression correlation as an example. How do we even start?

If you’re using gene-level mRNA values, should you average CpG site methylation in the 1kb upstream of which TSS?

Options abound: canonical transcripts, RefSeq transcripts, or MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts. Which is best?

Or maybe you ditch gene-level mRNA values and use transcript-level data from tools like kallisto or Salmon.

With transcript-level data, you can correlate specific TSS methylation beta values to corresponding transcript mRNA values.

But wait—what about CpG islands? Not all genes have them, and gene expression can be tied to CpG sites far from the TSS.

Some genes’ expression is anti-correlated with methylation at distal CpG sites, not within the 1kb upstream of the TSS.

Even for this “simple” question—DNA methylation vs. gene expression—there are countless variables, decisions, and nuances to consider.

• Biology’s inherent complexity makes bioinformatics challenging.
• Clear analysis requires informed decisions at every step.
• Tools like MANE transcripts or transcript-level quantification help refine results.
• Always define your biological question before diving into analysis.
• Choose annotation strategies that match your data and goals.
• Stay updated with best practices in transcriptomics and epigenomics.

My other posts that you may find helpful:

  1. 12 years ago, Tommy was naive about bioinformatics.
  2. You can not trust a boxplot with a small sample size.

3. Genome browsers are essential tools for genomics data analysis. I introduced three of them.

4. Survival analysis is a critical skill for bioinformaticians. I introduced you to tools and caveats when interpreting survival curves.

5. How to Reorder Bars in ggplot2 (and What Happens Behind the Scenes)

6. Jupyter Notebook alternative.

7. How can I start learning bioinformatics? This is my story.

8. Tackling Ambient RNA in Single-Cell RNA-seq Data Ambient RNA contamination

9. A unix one-liner to split a multi-fasta to multiple files

10. No, you do not always need Single-Cell RNA-seq.

Sharing is caring, if you can send this newsletter to someone who may find it helpful, that will be great!

Happy Learning!

Tommy aka Crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 12 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page