profile

Chatomics! — The Bioinformatics Newsletter

Why most single-cell annotation benchmarks are missing the point


Hello Bioinformatics lovers,

Tommy here. I made a 40-minute video to show you how to do RNAseq analysis end-to-end.

Watch it here!

I was recently interviewed by Pure Storage: Data Cleaning ‘Janitorial Work’ is Key to Unlocking Life Sciences Breakthroughs

Today, we will talk about single-cell cell type annotation.

Your model might be accurate — but is it biologically meaningful?

Everyone’s benchmarking single-cell annotation models these days.

You train on a million cells. You annotate a new dataset.

Everything looks great.

But here’s the uncomfortable truth:

Your prediction is only as good as your reference.

What’s the ground truth, really?

We don’t even have a universal definition of a “cell type.”

It’s a human-imposed label, a convenient shorthand.

In reality, many cells exist in a continuum — not discrete boxes.

Take CD8 T cells, for example. In healthy tissues, they behave differently from tumors.

You’ll find states like:

  • Progenitor exhausted
  • Central memory
  • Effector memory
  • Naive
  • Terminally exhausted

Each state has unique transcriptional signatures — and biological implications.


Here’s the issue:

If your model is trained on millions of cells without state-level annotation,

it can’t predict these nuanced states in a new dataset.

And those states are exactly what matter in biology.

They drive immune responses, therapy outcomes, and disease progression.

📖 Example:


So what’s the point of a model that just says “CD4” or “CD8”?

If your predictions stop at the broad categories, you’re missing the biology that truly matters.

Instead, we should be building models that understand states, not just types.

I highly recommend exploring ProjecTILs — a tool that does this elegantly.


Further reading

If you’re serious about rethinking how we define cell identity, start here:

  1. 🧬 On cell types and cell states — Matthew Bernstein
  2. The evolving concept of cell identity in the single-cell era
  3. A periodic table of cell types

The takeaway

We build models to understand biology — not to show off scale.

If your model doesn’t bring new biological insight, it doesn’t matter how many cells it was trained on.

Deep domain knowledge isn’t optional.

It’s what separates real understanding from mere computation.

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Chatomics! — The Bioinformatics Newsletter

Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free

Share this page