Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free
Hello Bioinformatics lovers, Tommy here. KNN looks simple. But in single-cell RNA-seq, it’s an art disguised as an algorithm. Everyone talks about k-nearest neighbors (KNN) like it’s the easiest algorithm in machine learning. In theory, it is: You classify a point based on the majority of its k closest neighbors. Just one hyperparameter—k. What could go wrong? A lot, actually. The Hidden ComplexityChoosing k is a balancing act:
It’s the classic tradeoff—except in real biological data, the balance is fragile. Why It Gets Messy in Single-Cell RNA-seqIn single-cell analysis, KNN isn’t applied directly to raw gene counts. We build a KNN graph on top of PCA-reduced space. Now you’re not tuning one knob, but two: 1️⃣ How many principal components (PCs) to keep. 2️⃣ What k to use. How Many PCs?Too few → you miss structure. Too many → you add noise. Seurat defaults to 50 PCs — but that’s arbitrary. For something simple like PBMC3k, 15 may suffice. For complex neuron datasets, I’ve used 100+. There’s no universal answer. Some rely on elbow plots or jackstraw tests. None are perfect. Sometimes, it’s just intuition backed by experience. (I wrote more on this here) Choosing k: The Real ArtIn Seurat or Scanpy, the default k = 20. But what if:
The truth: k depends on your data size, heterogeneity, and biological questions. So How Do You Decide?You experiment. Try k = 10, 30, 50. Compare cluster structure, marker genes, biological consistency. Ask:
Machine learning gives you metrics. But in biology, sanity checks matter more than F1 scores. The Real LessonKNN doesn’t reveal truth—it reveals proximity. It builds the scaffold of similarity. You must decorate it with biological insight. Key Takeaways
Know the math. Trust your eyes. Ask good questions. That’s where the art lives. Happy Learning! Tommy aka crazyhottommy PS: If you want to learn Bioinformatics, there are other ways that I can help:
Stay awesome! |
Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free