profile

Hi! I'm Tommy Tang

GSEA Demystified: How to Make Sense of Gene Set Enrichment Analysis


Hello Bioinformatics lovers,

I post a long post on Linkedin every day. I then pick the one with the most views and expand it to this newsletter.

It is surprising to me that the GSEA post gets the most impressions,

although the method was published over 10 years ago.

I am a data scientist, and I use data to guide my decisions:)

So, let's learn more about it!

If you've ever worked with RNA-seq data, you've probably been asked to perform Gene Set Enrichment Analysis (GSEA).

It is one of the most widely used tools for pathway analysis (> 50k citations), yet it often feels like a black box.

If you are not patient, check out my chatomics youtube video for a step-by-step tutorial for GSEA with RNAseq.

So, let’s break it down.

What is GSEA?

GSEA was developed by the Broad Institute to help interpret genome-wide expression data.

Unlike other pathway analysis tools, GSEA does not rely on arbitrary cutoffs for gene selection.

Instead, it ranks all genes in an experiment and checks if a predefined gene set is enriched at the top or bottom of the list.

Original paper: Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles


How GSEA Works

  1. Rank all genes based on how well they separate your conditions (e.g., tumor vs. normal) using a metric.
  2. Compare gene sets (from public databases) to this ranked list.
  3. Compute an Enrichment Score (ES) to see if a gene set is enriched at the top (upregulated) or bottom (downregulated).
  4. Normalize the ES (NES) and adjust for multiple testing (FDR).

A positive NES means the gene set is enriched at the top (upregulated genes).
A negative NES means the gene set is enriched at the bottom (downregulated genes).

GSEA vs. DAVID: What’s the Difference?

GSEA differs from tools like DAVID, which uses gene lists filtered by arbitrary cutoffs (e.g., p-value < 0.05).

  • DAVID: Compares a subset of genes to curated databases and checks for overrepresentation.
  • GSEA: Uses all detected genes, avoiding cutoff bias and leveraging a ranked approach.

This distinction is crucial. By analyzing the full dataset, GSEA provides a more nuanced and statistically robust pathway analysis.

Pre-Ranked Gene List: When to Use It?

If you have fewer than three samples per condition, the default GSEA ranking method will not work. Instead, use GSEAPreranked, where you manually rank genes using:

Signed fold change × -log10(p-value)

This ensures that genes with high fold change and low p-values get ranked at the top, making the enrichment test more meaningful.

Important Tip: Always provide all detected genes from your experiment—not just differentially expressed ones.

Interpretation of the GSEA figure

We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO (knock-out) vs WT, (wild-type).

If a fold change is positive, that means the gene is up-regulated in KO). so the genes on the top (or left) are the genes with higher expression value in the KO group, while the genes on the bottom (or right) are the genes with lower expression value in KO group.

How to read the figure?

The X-axis is all your genes in the experiment (~ 20,000 in this case) pre-ranked by your metric. Each black bar is the gene in this gene set(pathway). You get an idea of where the genes are located in the pre-ranked list.

Enrichment Score is calculated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.

You see that most of the black bars representing the genes in this gene set (glycolysis) are in the front of the whole gene list.

The genes for the IL6 pathway are located in the end of the whole gene list.

Key Takeaways

  • GSEA ranks all genes, avoiding arbitrary cutoffs.
  • It assesses enrichment across the full dataset, not just a subset.
  • Pre-ranked GSEA is useful for datasets with few samples per condition.
  • Providing all detected genes ensures accurate results.

Want to dive deeper?
Read this great post by Mark Ziemann: Pathway Analysis with GSEA.

and my old write-up here https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/GSEA_explained.md

Have you used GSEA in your research? What challenges have you faced? Hit reply and let me know.

Other posts from the last week you may find useful

  1. Human BioMolecular Atlas Program (HuBMAP): 3D Human Reference Atlas construction and usage | Nature Methods
  2. Stop losing track of your analysis files! Name folders with dates for better project organization. 🧵👇
  3. Tired of messy command-line output? Make it clean & readable with these simple tricks! 🧵👇
  4. Running out of disk space? Find out where your storage is going with these simple commands. 🧵👇
  5. Messy data slows you down. Here’s why a tidy dataframe is a game-changer in bioinformatics. 🧵👇
  6. Want to copy large files efficiently? Use rsync instead of cp.
  7. As a biologist, you work with a lot of data. But do you know how to use it to its fullest? Learning to program can change your research and career! Let's explore the top tips to get you started.
  8. To up-level your R skills, It is time to write an R package 🧵
  9. An old talk I gave: Single cell analysis: best practices and unsolved problems

Happy Learning!

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page