Your omics predictor is probably lying to you

Published 2 months ago • 2 min read

Hello Bioinformatics lovers,

How is your first week of 2026?

We are all busy with setting goals, but the real power is in executing them daily (e.g., I have been posting on LinkedIn for 693 consecutive days).

The math is simple: you hit your daily goal --> you will hit your weekly goal --> you will hit your monthly goal --> you will hit your yearly goal --> you will hit your 5-year goal --> you will hit your 10-year goal.

Life has to be intentional. If you have an idea where you want to be in 10 years and then show up every day, you will get there! For sure.

Today, we will talk about machine learning in omics data.

You built a classifier. The accuracy is 98%. The p-value is 0.001. Your PI is excited.

But here’s the uncomfortable truth: it’s probably worthless.

Not because you’re a bad scientist. Because high-dimensional omics data has a nasty habit of whispering back exactly what you want to hear.

The curse that everyone ignores

With thousands of genes and hundreds of samples, you can always find a pattern. Even in random noise.

This isn’t a bug—it’s mathematics. Given enough features, you can draw a hyperplane to separate any two classes. The question isn’t “can I fit this data?” It’s “will this work on data I haven’t seen?”

And that’s where most omics classifiers fall apart.

The validation mistake costing you years

You’re using cross-validation. Good start. But answer this honestly:

Did you select your features (in the full dataset) before splitting into CV folds?

If yes, you’ve leaked information. Your model has peeked at the test set through the features you chose.

The performance you’re seeing? Inflated. Every time.

The fix is nested cross-validation:

• Outer loop: estimates true error

• Inner loop: tunes hyperparameters

It’s slower. It’s more complex. But it’s the only honest answer.

The enemy hiding in your metadata

Regularization helps—LASSO, Ridge, Elastic Net all penalize complexity. But they can’t save you from confounders.

Batch effects. Sequencing platform. Age. Study site. If any of these correlate with your outcome, they’ll fake predictive power beautifully.

Random CV splits won’t catch this. Only external validation will—preferably on data from a completely different cohort, different lab, different year.

If your model hasn’t been tested on independent data, you don’t have a model. You have a hypothesis.

The replication crisis

Here’s a pattern you’ve seen before:

Paper claims breakthrough classifier
External validation fails
Quiet retraction (or worse, no retraction)

The usual suspects:

• Data leakage from improper CV

• Confounding baked into the signal

• Overfitting mistaken for discovery

• Cherry-picked results across many attempts

The uncomfortable truth? Most of these aren’t fraud. They’re statistical illiteracy dressed up as innovation.

I have a couple of papers in my repo https://github.com/crazyhottommy/machine-learning-resource/tree/master make sure you read them.

The 2% improvement trap

Your new method beats the baseline by 2%. Exciting, right?

Not unless you can show that improvement holds across multiple independent datasets with proper statistical testing. One dataset, one comparison? That’s an anecdote, not evidence.

What actually works

The best omics classifiers I’ve seen share common traits:

• Simpler than you’d expect (fewer features, not more)

• Validated exhaustively (multiple external cohorts)

• Transparent about limitations (honest about where they fail)

• Properly controlled (batch effects identified and addressed)

Complexity is easy. Simplicity that generalizes is hard.

The bottom line

Omics gives you enough rope to hang yourself. Thousands of features mean thousands of ways to find patterns that don’t exist.

The discipline isn’t in building models—it’s in validating them honestly enough to know when they’re real.

Your data will always tell you a story. Your job is to know when it’s lying.

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help. please foward it to your friends if you find it helpful.

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
You can always read my Linkedin post daily https://www.linkedin.com/in/%F0%9F%8E%AF-ming-tommy-tang-40650014/recent-activity/all/

Stay awesome!