profile

Hi! I'm Tommy Tang

How Sample Naming Wastes Millions—and What Bioinformatics Can Do About It


Hello Bioinformatics lovers,

Tommy here. I was interviewed yesterday by Nature yesterday, about best practices using

spreadsheets and naming files.

P.S.: This is a must read for any scientists (both wet and dry) https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

It turns out to be a big problem for bioinformaticians.


How many hours do bioinformaticians lose matching sample IDs across assays?
Too many.
And every one of them is avoidable.

Let’s talk about why this keeps happening—and how we can stop it.


The Chaos Begins with a Name

The same cell line, labeled differently:

  • MCF7
  • MCF-7
  • MDA-MB-361
  • MDA_MB_361
  • MDAMB361

And that’s within the same team.


Then Come the Timepoints

Add time, treatment, or dose… and now you’ve got:

  • MCF7_24h
  • MCF7_48h
  • MCF7_resistant
  • MCF7_combo_dose2

And this is just one dataset.


The Multi-Omics Nightmare

Team A runs RNA-seq.
Team B runs proteomics.
Each with their own naming system.
No coordination. No UUIDs.

Then someone says:
“Let’s integrate the data!”
You sigh—and open a spreadsheet that will ruin your evening.


Sound Familiar?

You spend 3 hours cleaning it.
You're still unsure if MCF7_24h_A is the same as MCF-7_T24.
Then you find out…
The same sample was sequenced twice, under two different names, in the same assay.

No one noticed.
Resources wasted.
Trust gone.


This Isn’t Just a Small Lab Problem

Even large-scale efforts like TCGA, pharma pipelines, and multi-million dollar projects suffer from mislabeling and duplication.

Why?
Because data pipelines are only as good as the names they start with.


The Fix Is Boring—but Powerful

  • Assign a UUID (universally unique identifier) to every biospecimen
  • Centralized sample registry
  • Enforced naming conventions
  • Robust sample tracking
  • Communication between wet lab and data teams

This isn’t rocket science.
It’s just science done right.


In a Small Company?

You can fix this in a day.

In a Large Company?

It’s harder.
Siloed teams. Legacy systems. lack communication

But fixing it will save millions per year.
And preserve trust in your data.


Final Thought:

Bioinformatics isn’t janitorial work.
It’s science.
But right now, we're wasting time cleaning up names like MCF7, MCF-7, and MCF_7.

That’s not where our talent belongs.


Takeaways:

  • Bad naming = wasted hours and broken analyses
  • Use UUIDs to align assays across teams
  • Communication is infrastructure
  • Let bioinformaticians do real science, not name-guessing

If you're spending time cleaning up naming errors, it's not your fault.
But it is time to demand better.

Start with one rule:
Every sample gets a universal name.

Other posts that you may find useful

  1. chatomics! New youtube video: How to calculate partial correlation controlling cancer types
  2. Deep neural networks (DNN) are just glorified linear models?
  3. ggfortify::autoplot() for your PCA plot may not be what you want, here is the details
  4. I am never a handyman, but when your wife mandates you to repair the screen, you learn how to do it. Here is how you can learn anything, including bioinformatics.
  5. Here is how I would start learning bioinformatics: A roadmap.
  6. How to organize your bioinformatics project without chaos.
  7. Bioinformatics is hard because of decisions. Thousands of tiny choices that shape your results. which CpG, reference, isoform.
  8. Sample swaps are the silent killer in bioinformatics.
  9. Ever wondered why analyzing RNA-seq data feels like walking through a fog with 20,000 dimensions?
  10. You can not do Bioinformatics with hard cutoffs and thresholds. what's your log2FC cutoff?
  11. You’re analyzing 10x Genomics single-cell RNA-seq and notice lots of intronic reads. Let’s unpack why introns show up—and why they matter. 🧵

PPS:

If you want to learn Bioinformatics, there are four ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page