The question to ask before "integrating" omics

Published about 2 months ago • 2 min read

Hello Bioinformatics lovers,

Tommy here. Today's newsletter is a little late as I was taking care of the kids in the morning.

Multiomics is hot and everyone is talking about it. Let's take an honest view on it.

Multi-omics integration sounds powerful. RNA-seq, methylation, proteomics — stack them together and unlock deeper biology, right?

Not quite. Before you run MOFA2 or DIABLO, there’s one question that decides everything. And most people skip it.

The question: shared or unique?

Do you want to find programs shared across omics layers, or signals unique to each modality?

That single choice dictates your method.

Unsupervised, shared variation across modalities → MOFA2. It’s essentially PCA generalized to multi-omics, returning latent factors that capture variance shared across layers.
Supervised, predicting a phenotype or outcome → DIABLO (from the mixOmics package). It builds a classifier from correlated features across modalities.
Graph-based methods → worth trying, but only if they outperform the simpler baselines on your data.

For a real study, you can use both — MOFA2 for discovery, DIABLO for prediction. They answer different questions.

Why you can’t just merge matrices

Here’s what makes multi-omics hard: your matrix is almost never complete.

RNA-seq on 200 samples.

Proteomics on 150.

Methylation on 180.

Concatenate them naively and two things happen:

High-dimensional modalities drown low-dimensional ones.
Batch effects create phantom clusters that look biological but aren’t.

Each modality also has its own statistical personality:

scATAC-seq is extremely sparse — most entries in the cell-by-peak matrix are zero.
Proteomics is noisy with substantial missingness.
RNA-seq has 20,000+ features.
Methylation platforms range from ~850K CpGs (EPIC array) to ~28 million (WGBS).

Good methods handle this through per-modality normalization, learned weights, or smart regularization. MOFA2, DIABLO, and weighted PCA all do some version of this.

Want to see how it fails?

Check my post: https://divingintogeneticsandgenomics.com/post/python-visium/

Spatial + gene expression integration went sideways without normalization.

Biology beats black boxes

These methods find correlations, not causes.

If the output doesn’t map back to a gene, CpG, or protein you can reason about, you don’t have a result — you have a vector.

Before you trust anything:

Check whether loadings recover known pathways.
Validate with an orthogonal experiment.
Test generalization across cohorts.

Math is nice. Biology decides whether you got it right.

Takeaways

Start with the question, not the tool.
Pick your method based on supervised vs unsupervised, shared vs unique.
Normalize each modality before integrating.
Validate on known biology before believing the output.

Multi-omics is messy. It’s worth it — if you know what you’re doing.

Resources:

Tools list: https://github.com/mikelove/awesome-multi-omics

Tool review: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7003173/

Overview: https://frontlinegenomics.com/a-guide-to-multi-omics-integration-strategies/

Happy Learning,

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
Lastly, I post on Linkedin every day on Bioinformatics and AI https://www.linkedin.com/in/%F0%9F%8E%AF-ming-tommy-tang-40650014/recent-activity/all/

Stay awesome!

PPS: