GSEA Demystified: How to Make Sense of Gene Set Enrichment Analysis

Published 3 months ago • 3 min read

Hello Bioinformatics lovers,

I post a long post on Linkedin every day. I then pick the one with the most views and expand it to this newsletter.

It is surprising to me that the GSEA post gets the most impressions,

although the method was published over 10 years ago.

I am a data scientist, and I use data to guide my decisions:)

So, let's learn more about it!

No alternative text description for this image

If you've ever worked with RNA-seq data, you've probably been asked to perform Gene Set Enrichment Analysis (GSEA).

It is one of the most widely used tools for pathway analysis (> 50k citations), yet it often feels like a black box.

If you are not patient, check out my chatomics youtube video for a step-by-step tutorial for GSEA with RNAseq.

So, let’s break it down.

What is GSEA?

GSEA was developed by the Broad Institute to help interpret genome-wide expression data.

Unlike other pathway analysis tools, GSEA does not rely on arbitrary cutoffs for gene selection.

Instead, it ranks all genes in an experiment and checks if a predefined gene set is enriched at the top or bottom of the list.

Original paper: Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles

How GSEA Works

Rank all genes based on how well they separate your conditions (e.g., tumor vs. normal) using a metric.
Compare gene sets (from public databases) to this ranked list.
Compute an Enrichment Score (ES) to see if a gene set is enriched at the top (upregulated) or bottom (downregulated).
Normalize the ES (NES) and adjust for multiple testing (FDR).

A positive NES means the gene set is enriched at the top (upregulated genes).
A negative NES means the gene set is enriched at the bottom (downregulated genes).

GSEA vs. DAVID: What’s the Difference?

GSEA differs from tools like DAVID, which uses gene lists filtered by arbitrary cutoffs (e.g., p-value < 0.05).

DAVID: Compares a subset of genes to curated databases and checks for overrepresentation.
GSEA: Uses all detected genes, avoiding cutoff bias and leveraging a ranked approach.

This distinction is crucial. By analyzing the full dataset, GSEA provides a more nuanced and statistically robust pathway analysis.

Pre-Ranked Gene List: When to Use It?

If you have fewer than three samples per condition, the default GSEA ranking method will not work. Instead, use GSEAPreranked, where you manually rank genes using:

Signed fold change × -log10(p-value)

This ensures that genes with high fold change and low p-values get ranked at the top, making the enrichment test more meaningful.

Important Tip: Always provide all detected genes from your experiment—not just differentially expressed ones.

Interpretation of the GSEA figure

We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO (knock-out) vs WT, (wild-type).

If a fold change is positive, that means the gene is up-regulated in KO). so the genes on the top (or left) are the genes with higher expression value in the KO group, while the genes on the bottom (or right) are the genes with lower expression value in KO group.

How to read the figure?

The X-axis is all your genes in the experiment (~ 20,000 in this case) pre-ranked by your metric. Each black bar is the gene in this gene set(pathway). You get an idea of where the genes are located in the pre-ranked list.

Enrichment Score is calculated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.

We see glycolysis is positively co-related with `KO`

You see that most of the black bars representing the genes in this gene set (glycolysis) are in the front of the whole gene list.

IL6_JAK_SATA3 signaling is positively co-related with `WT`.

The genes for the IL6 pathway are located in the end of the whole gene list.

Key Takeaways

GSEA ranks all genes, avoiding arbitrary cutoffs.
It assesses enrichment across the full dataset, not just a subset.
Pre-ranked GSEA is useful for datasets with few samples per condition.
Providing all detected genes ensures accurate results.

Want to dive deeper?
Read this great post by Mark Ziemann: Pathway Analysis with GSEA.

and my old write-up here https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/GSEA_explained.md

Have you used GSEA in your research? What challenges have you faced? Hit reply and let me know.

Chatomics! — The Bioinformatics Newsletter