profile

Hi! I'm Tommy Tang

Matrix Factorization: The Secret Language of Gene Expression


Hello Bioinformatics lovers,

Tommy here. Summer is finally here in Boston!

I look forward to more outdoor activities with the kids.

Okay, let's get into today's topic: Matrix Factorization

Matrix algebra used to feel abstract.

Something from a dusty textbook. All symbols, no meaning.

Then I applied it to biology.

And suddenly, it was everywhere—in my RNA-seq pipelines, my proteomics scripts, even single-cell analysis.

Let me show you what changed.


The Matrix Is the Foundation of -Omics

Biological data lives in matrices:

  • Genes × Samples (bulk RNA-seq)
  • Genes × Cells (scRNA-seq)
  • Proteins × Samples (proteomics)

But matrices don’t just store data. We analyze, reduce, factor, and project them.

Enter: Matrix Factorization.


What Is Matrix Factorization?

It’s the art of breaking a large matrix into the product of smaller ones.

A simple example: Non-negative Matrix Factorization (NMF)
We approximate the original matrix X as:
X ≈ W × H

  • X: your input (e.g., gene expression)
  • W: genes × latent factors
  • H: latent factors × samples

You’re not just compressing data. You’re learning the hidden biology that drives it.

Need more background on latent space? I wrote a post here.


Let’s Make It Real

Imagine this:

You have an RNA-seq matrix: 20,000 genes × 500 samples.

You factor it with NMF into 5 latent components.
Those components? Hidden transcriptional programs—pathways, subtypes, or cell states driving expression.


Matrix Factorization in the Wild

It’s not just NMF.

Matrix factorization powers tools you already use:

  • PCA (Principal Component Analysis)
  • ICA (Independent Component Analysis)
  • Topic Modeling (think: LDA)

These are all different ways to crack open the matrix and ask:
What’s underneath?


Why NMF Works So Well for Bio

NMF is especially useful in omics because expression values are non-negative. You can’t have “negative expression.”

Here’s a minimal R example:

library(NMF)
res <- nmf(expr_matrix, 5)
basis <- basis(res) # genes × 5 factors
coef <- coef(res) # 5 factors × samples

I dive deep into using NMF on single-cell data in this post: https://divingintogeneticsandgenomics.com/post/matrix-factorization-for-single-cell-rnaseq-data/


How It’s Been Used

In breast cancer research, NMF helped define molecular subtypes from gene expression profiles.

That insight changed how we stratify patients.

Matrix factorization has also been used to:

  • Denoise noisy scRNA-seq data
  • Reduce dimensionality for visualization
  • Cluster samples or cells
  • Identify gene programs (aka "metagenes")

One Cell, Many Signals

In single-cell RNA-seq, matrix factorization helps decode what each cell is made of.

Instead of 20,000 numbers per cell, you get a handful of loadings—weights on latent processes. You cluster cells by those.

That’s powerful.


Further Reading

Want to understand it deeper?

Read:
“Matrix Factorization Techniques for Omics” https://www.sciencedirect.com/science/article/pii/S0168952518301240


Key Takeaways

  • Biology speaks to computation through matrices
  • Matrix factorization helps you extract meaning, not just numbers
  • PCA, NMF, ICA are your tools to decode biological signals

What You Should Do Next

  • Run prcomp() or nmf() on your data
  • Ask: what do these components represent biologically?
  • Use heatmaps to visualize gene × factor contributions

Final Thought

Stop seeing gene expression as a pile of numbers.

Start seeing it as a mixture of hidden processes—subtypes, pathways, cell states.

Matrix factorization helps you decode that language.

That’s why it’s not just math.
It is discovery.


Other posts from the past week that you may find useful

  1. Everyone wants to master bioinformatics fast. But here's the cold truth: Speed is a lie. Time is your ally.
  2. 7 FREE Books to learn data science 🧵
  3. Too many bioinformatics analysis crash silently. Have your positive control!
  4. How to calculate gene-gene correlation in scRNAseq.
  5. Unix trick with paste - -
  6. Bioinformaticians: Before you heroically code your own method...STOP. You might be about to reinvent a very broken wheel.
  7. Which programming language should I use? Python or R?
  8. DESeq2-MultiBatch: Batch Correction for Multi-Factorial RNA-seq Experiments
  9. What does the dot product have to do with bioinformatics?
  10. I used to be a wet lab biologist. Now I’m a bioinformatician. It took me 10 years.

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page