Your reference genome is lying to you

Published about 2 months ago • 2 min read

Hello Bioinformatics lovers,

Tommy here. It is Halloween, and I wrote this little story. Hope you enjoy my life lesson as much as my bioinformatics posts :)

Today, we will talk about references and other choices you make that may affect your bioinformatics analysis results.

*So many choices to make before you do the analysis*

You just picked GRCh38 for your analysis.

That one choice—before you’ve written a single line of code—

already shapes what variants you’ll find, what genes you’ll miss, and whether your results are reproducible.

Here’s what most bioinformaticians don’t realize: every reference genome version comes with tradeoffs,

and the wrong choice can reduce variant calling sensitivity and increase false positives.

The decision cascade

If you’re using GRCh37, you’ll want hs37d5 with decoy sequences to reduce false positives.

For GRCh38, use GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz to avoid ALT contig issues that give mapping quality zero to reads in flanking sequences.

But wait—T2T-CHM13 now provides the first complete human genome sequence with gapless assemblies for all chromosomes, including telomeres and centromeres that were missing from previous references.

It adds nearly 200 million base pairs of novel sequence and corrects errors in GRCh38.

And then there’s the pangenome. Pangenome-aware DeepVariant reduces variant calling errors by up to 25.5% compared to using a linear reference genome alone.

Four viable options. Each changes your results.

The methylation trap

Let’s say you’re correlating DNA methylation with gene expression.

Simple question: which region near the gene matters?

1kb upstream of TSS? 2kb? Maybe 200bp downstream?

MANE Select provides one representative transcript per protein-coding gene as a standard for clinical reporting.

But the Ensembl canonical transcript will be the MANE Select transcript if available,

or a transcript chosen by algorithm otherwise.

Meanwhile, Bioconductor’s genes() function tends to select longest transcripts by default—

which may not be the most biologically relevant isoform.

Same gene.

Three different TSS positions.

Completely different methylation correlations.

Why this matters

Studies show that 1.5% of SNVs and 2.0% of indels are called differently depending on which reference genome version is used.

A notable 76.6% of these discordant variants cluster in specific problem regions that affect variant interpretation for rare and common diseases.

You’re not just analyzing data.

You’re navigating thousands of silent assumptions:

- Which transcript represents your gene

- How far “near” the TSS extends

- Whether to use chromatin interaction data for distal sites

- Which methylation positions actually matter biologically

The real skill

With ChatGPT, anyone can write code to extract sequences, calculate correlations, or run DESeq2.

The hard part? Knowing that:

- Your chosen reference might have false duplications in clinically important genes

- Default functions may select transcripts based on length, not biology

- MANE Plus Clinical exists for 55 genes where the Select transcript alone isn’t sufficient to report all pathogenic variants

- Distal regulatory elements might be more important than promoter methylation

What to do

Document everything. Not just your methods—your decisions.

Which reference did you use, and why?
How did you define gene coordinates?
What assumptions did your default functions make?
Could alternative approaches change your conclusions?

Read the documentation. Different genome versions have different issues: ALT contigs, PAR regions, mitochondrial sequences, and coordinate systems all vary.

Stay current. T2T-CHM13 unlocks complex genomic regions—centromeres, telomeres, and segmental duplications—that were previously invisible.

Pangenome references can reduce reference bias that affects variant calling, particularly in populations underrepresented in the linear reference.

Bottom line

Bioinformatics isn’t hard because of code.

It’s hard because every analysis is built on decisions most people never notice making.

The defaults in your favorite package? Someone chose those. T

he reference genome everyone uses? It reflects specific tradeoffs.

Master the craft by making decisions visible. Question the defaults.

Understand where your annotations come from.

Because the truth isn’t in the code. It’s in the choices that came before you ever opened R.

Did this change how you think about your next analysis?

Hit reply and let me know what assumptions you’re questioning now.

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Share this page

Chatomics! — The Bioinformatics Newsletter