Matrix Factorization: The Secret Language of Gene Expression

Published 3 months ago • 2 min read

Hello Bioinformatics lovers,

Tommy here. Summer is finally here in Boston!

I look forward to more outdoor activities with the kids.

Okay, let's get into today's topic: Matrix Factorization

Matrix algebra used to feel abstract.

Something from a dusty textbook. All symbols, no meaning.

Then I applied it to biology.

And suddenly, it was everywhere—in my RNA-seq pipelines, my proteomics scripts, even single-cell analysis.

Let me show you what changed.

The Matrix Is the Foundation of -Omics

Biological data lives in matrices:

Genes × Samples (bulk RNA-seq)
Genes × Cells (scRNA-seq)
Proteins × Samples (proteomics)

But matrices don’t just store data. We analyze, reduce, factor, and project them.

Enter: Matrix Factorization.

What Is Matrix Factorization?

It’s the art of breaking a large matrix into the product of smaller ones.

A simple example: Non-negative Matrix Factorization (NMF)
We approximate the original matrix X as:
X ≈ W × H

X: your input (e.g., gene expression)
W: genes × latent factors
H: latent factors × samples

You’re not just compressing data. You’re learning the hidden biology that drives it.

Need more background on latent space? I wrote a post here.

Let’s Make It Real

Imagine this:

You have an RNA-seq matrix: 20,000 genes × 500 samples.

You factor it with NMF into 5 latent components.
Those components? Hidden transcriptional programs—pathways, subtypes, or cell states driving expression.

Matrix Factorization in the Wild

It’s not just NMF.

Matrix factorization powers tools you already use:

PCA (Principal Component Analysis)
ICA (Independent Component Analysis)
Topic Modeling (think: LDA)

*Matrix Factorization for Biological Data*

These are all different ways to crack open the matrix and ask:
What’s underneath?

Why NMF Works So Well for Bio

NMF is especially useful in omics because expression values are non-negative. You can’t have “negative expression.”

Here’s a minimal R example:

library(NMF) res <- nmf(expr_matrix, 5) basis <- basis(res) # genes × 5 factors coef <- coef(res) # 5 factors × samples

I dive deep into using NMF on single-cell data in this post: https://divingintogeneticsandgenomics.com/post/matrix-factorization-for-single-cell-rnaseq-data/

How It’s Been Used

In breast cancer research, NMF helped define molecular subtypes from gene expression profiles.

That insight changed how we stratify patients.

Matrix factorization has also been used to:

Denoise noisy scRNA-seq data
Reduce dimensionality for visualization
Cluster samples or cells
Identify gene programs (aka "metagenes")

One Cell, Many Signals

In single-cell RNA-seq, matrix factorization helps decode what each cell is made of.

Instead of 20,000 numbers per cell, you get a handful of loadings—weights on latent processes. You cluster cells by those.

That’s powerful.

Further Reading

Want to understand it deeper?

Read:
“Matrix Factorization Techniques for Omics” https://www.sciencedirect.com/science/article/pii/S0168952518301240

Key Takeaways

Biology speaks to computation through matrices
Matrix factorization helps you extract meaning, not just numbers
PCA, NMF, ICA are your tools to decode biological signals

What You Should Do Next

Run prcomp() or nmf() on your data
Ask: what do these components represent biologically?
Use heatmaps to visualize gene × factor contributions

Final Thought

Stop seeing gene expression as a pile of numbers.

Start seeing it as a mixture of hidden processes—subtypes, pathways, cell states.

Matrix factorization helps you decode that language.

That’s why it’s not just math.
It is discovery.

Chatomics! — The Bioinformatics Newsletter

Matrix Factorization: The Secret Language of Gene Expression

Other posts from the past week that you may find useful