Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free
|
Hello Bioinformatics lovers, Tommy Here. 2 Months have past in 2026! Are you learning new skills? Today, we will talk about ML. High accuracy doesn't mean your model learned biology. It might have learned your batch effects, your sequencing center, your hospital's imaging artifacts. Here's a famous example. Researchers trained a model to classify wolves vs. dogs. High accuracy. Then they used LIME to explain its predictions — and found the model was classifying snow in the background, not the animals. Snow = wolf. Grass = dog. This happens in our field constantly. Three examples: COVID-19 X-ray classifiers learned hospital-specific artifacts — image contrast, patient positioning (AP vs. PA), even source hospital metadata. High accuracy. Wrong reasons. Deep learning for cancer prognosis on TCGA. Howard, Kather & Pearson (Cancer Cell, 2023) showed that without site-preserved cross-validation, models trained on TCGA histology data can learn to infer the submitting institution rather than tumor biology — inflating accuracy that won't replicate elsewhere. TCGA germline exome data (Rasnic et al., BMC Cancer, 2019) carries severe batch effects by sequencing center. Up to 30% variability in called germline variants — including in BRCA1 and KRAS. A model trained without batch correction is learning Broad vs. Baylor, not cancer biology. The pattern is always the same: your model finds the easiest signal that predicts your labels. If technical variables correlate with your outcome, it will find that shortcut before it finds the biology. Four ways to protect yourself:
Accuracy is easy. Getting the right answer for the right reasons is the hard part. Have you run into this in your own work? Protip, next time when you evaluate your ML model, feed this paper (Multimodal deep learning: An improvement in prognostication or a reflection of batch effect?) into Claude Code and ask it examine your model accordingly. PS: If you want to learn Bioinformatics, there are four ways that I can help:
Stay awesome! PPS, get your privacy back. |
Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free