MCF7 or MCF-7? Your data doesn't care

Published 2 months ago • 2 min read

Hello Bioinformatics lovers,

Tommy here. It is already in the middle of May, 2026! Are you overwhelmed by the advancement of the AI stuff?

openclaw, hermes, paperclip etc etc (if you do not know them, google or chatGPT them).

But let's work on some fundamentals for Bioinformatics first.

MCF7. MCF-7. MDA-MB-361. MDA_MB_361. MDAMB361.

That’s one cell line, one team, five names. Now add timepoints: MCF7_24h, MCF7_48h, MCF7_resistant, MCF7_combo_dose2. You haven’t opened the second dataset yet.

This is the bioinformatics tax nobody puts in a grant budget. And it’s bigger than you think.

The hidden cost

You spend three hours cleaning a sample sheet and still get it wrong. RNA-seq comes from Team A with one naming logic, proteomics from Team B with another, and someone in the kickoff meeting says “let’s do multi-omics.”

The worst version of this: a sample gets sequenced twice. Same cell line, different name, nobody catches it.

Reagents burned. Sequencing slots burned. And once you find one duplicate, you start wondering what else is wrong — which is the real damage. You lose trust in the data.

This isn’t a small-lab problem. The NCI Genomic Data Commons built its entire biospecimen model around UUIDs precisely because TCGA-scale projects can’t survive on team-invented IDs. ISBER’s biobank Best Practices and the ISBT 128 standard exist for the same reason.

What actually works

Assign a UUID upfront, at biospecimen registration, before any assay touches it. Link the UUID to source, metadata, and assay history. Every downstream file inherits it.

Concrete practices that hold up:

One source of truth for IDs. A centralized sample registry. Not a shared spreadsheet.
UUIDs as the primary key. Human-readable names can exist for benchwork, but they’re aliases, never the join key. The cual-id approach — UUID for computers, short human-friendly code for tubes — is a clean pattern.
Enforce naming schemas at submission. Validate before data lands in the system, not after.
Treat cross-team communication as part of the protocol. If RNA-seq and proteomics teams have never agreed on a sample ID, you don’t have a multi-omics project. You have two projects with overlapping samples.

In a small org this is a Friday afternoon. In a big org it’s a years-long fight against legacy systems and team silos. It’s still worth the fight, because the alternative is a senior scientist matching MCF-7 to MCF7 by hand.

The takeaway

Sample tracking isn’t janitorial. It’s the substrate everything else sits on — your DE analysis, your multi-omics integration, your clinical correlations, your reproducibility. Get the IDs right and the rest of the stack stops lying to you.

If your team doesn’t have a UUID policy, that’s the first ticket to file Monday morning.

Happy Learning!

Tommy aka crazyhottommy

What’s the worst sample-naming story you’ve inherited? Hit reply — I read every one.

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
Lastly, I post daily on Linkedin on Bioinformatics and AI. make sure to follow me if you do not want to miss any posts.

Stay awesome!

PPS:

My Nextflow Summit talk on Reproducible Bioinformatics is now on Youtube. https://www.youtube.com/watch?v=c9QJm8N67BAT

Slides can be found here https://divingintogeneticsandgenomics.com/talk/2026-nextflow-summit-boston/

Share this page

Chatomics! — The Bioinformatics Newsletter

MCF7 or MCF-7? Your data doesn't care

The hidden cost

What actually works

The takeaway

Chatomics! — The Bioinformatics Newsletter