Why KNN Isn’t as “Simple” as Everyone Says (Especially in Single-Cell RNA-seq)

Published 2 months ago • 1 min read

Hello Bioinformatics lovers,

Tommy here. KNN looks simple. But in single-cell RNA-seq, it’s an art disguised as an algorithm.

No alternative text description for this image

Everyone talks about k-nearest neighbors (KNN) like it’s the easiest algorithm in machine learning.

In theory, it is: You classify a point based on the majority of its k closest neighbors.

Just one hyperparameter—k. What could go wrong?

A lot, actually.

The Hidden Complexity

Choosing k is a balancing act:

Small k → low bias, high variance.
Large k → high bias, low variance.

It’s the classic tradeoff—except in real biological data, the balance is fragile.

Why It Gets Messy in Single-Cell RNA-seq

In single-cell analysis, KNN isn’t applied directly to raw gene counts.

We build a KNN graph on top of PCA-reduced space.

Now you’re not tuning one knob, but two:

1️⃣ How many principal components (PCs) to keep.

2️⃣ What k to use.

How Many PCs?

Too few → you miss structure.

Too many → you add noise.

Seurat defaults to 50 PCs — but that’s arbitrary.

For something simple like PBMC3k, 15 may suffice. For complex neuron datasets, I’ve used 100+.

There’s no universal answer. Some rely on elbow plots or jackstraw tests. None are perfect.

Sometimes, it’s just intuition backed by experience.

(I wrote more on this here)

Choosing k: The Real Art

In Seurat or Scanpy, the default k = 20. But what if:

You have a rare cell type (say 50 cells total)? Then k=20 may drown it in noise.
You have 100k cells? Then k=5 may over-fragment your clusters.

The truth: k depends on your data size, heterogeneity, and biological questions.

So How Do You Decide?

You experiment. Try k = 10, 30, 50. Compare cluster structure, marker genes, biological consistency.

Ask:

Are rare populations preserved?
Do known markers align?
Does k change your conclusions?

Machine learning gives you metrics. But in biology, sanity checks matter more than F1 scores.

The Real Lesson

KNN doesn’t reveal truth—it reveals proximity. It builds the scaffold of similarity. You must decorate it with biological insight.

Key Takeaways

KNN is simple in theory, nuanced in practice.
PCA + KNN = a double dose of parameter tuning.
Always visualize, validate, and question defaults.
In bioinformatics, rigor isn’t always about formulas—it’s about judgment.

Know the math. Trust your eyes. Ask good questions.

That’s where the art lives.

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are other ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Share this page

Chatomics! — The Bioinformatics Newsletter