Your UMAP plot might be lying to you

Published about 1 month ago • 1 min read

Hello Bioinformatics lovers,

Tommy here. Can you believe it is June now? Boston summer is finally here and I wish you are learning new things.

Let's talk about UMAP today.

Tilt a photo of a hand and the same gesture reads as a wave or a threat. Same data, different projection, different meaning. That is the risk every time you flatten single-cell data into two dimensions.

A single-cell experiment gives you expression for thousands of genes per cell. You cannot see thousands of dimensions.

So you reduce them: PCA, t-SNE, UMAP. Each method makes a different trade, and each tells a different story about the same cells.

PCA is linear. It ranks new axes by variance, and distances between points carry real meaning: how much two cells differ overall.

That makes it interpretable. It also misses nonlinear structure, where a lot of biology lives.

t-SNE and UMAP are nonlinear. Both build a nearest-neighbor graph and place cells so that close neighbors in high-dimensional space stay close on the plot.

They preserve local structure. They do not preserve distance between distant clusters.

Here is where people fool themselves. On a UMAP, two clusters sit far apart and you read a deep biological difference into the gap.

The gap is not a measurement.

Neither method preserves those large distances, so the space between clusters tells you little. Cluster size misleads the same way: t-SNE inflates sparse clusters and compresses dense ones, so a big blob is not a more heterogeneous population.

So read it for what it does well:

Use PCA when you need interpretable distances and variance you can reason about.
Use t-SNE or UMAP to see local structure: rare populations, developmental branches, cluster separation.
Confirm patterns with marker genes and expression, not with where the dots landed.
Treat gaps, blob sizes, and positions as suggestions, never as conclusions.

Dimensionality reduction compresses the data. It does not do your thinking.

That striking UMAP is the start of the analysis, not the result.

What is the worst over-read UMAP you have seen in a paper or a talk? Reply and tell me. I collect these.

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
Lastly, I post daily on Linkedin and you may find useful oens there https://www.linkedin.com/in/%F0%9F%8E%AF-ming-tommy-tang-40650014/recent-activity/all/

Stay awesome!

Share this page

Chatomics! — The Bioinformatics Newsletter

Your UMAP plot might be lying to you

Chatomics! — The Bioinformatics Newsletter