I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!
Hello Bioinformatics lovers, I post a long post on Linkedin every day. I then pick the one with the most views and expand it to this newsletter. It is surprising to me that the GSEA post gets the most impressions, although the method was published over 10 years ago. I am a data scientist, and I use data to guide my decisions:) So, let's learn more about it! If you've ever worked with RNA-seq data, you've probably been asked to perform Gene Set Enrichment Analysis (GSEA). It is one of the most widely used tools for pathway analysis (> 50k citations), yet it often feels like a black box. If you are not patient, check out my chatomics youtube video for a step-by-step tutorial for GSEA with RNAseq. So, let’s break it down. What is GSEA?GSEA was developed by the Broad Institute to help interpret genome-wide expression data. Unlike other pathway analysis tools, GSEA does not rely on arbitrary cutoffs for gene selection. Instead, it ranks all genes in an experiment and checks if a predefined gene set is enriched at the top or bottom of the list. Original paper: Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles How GSEA Works
A positive NES means the gene set is enriched at the top (upregulated genes). GSEA vs. DAVID: What’s the Difference?GSEA differs from tools like DAVID, which uses gene lists filtered by arbitrary cutoffs (e.g., p-value < 0.05).
This distinction is crucial. By analyzing the full dataset, GSEA provides a more nuanced and statistically robust pathway analysis. Pre-Ranked Gene List: When to Use It?If you have fewer than three samples per condition, the default GSEA ranking method will not work. Instead, use GSEAPreranked, where you manually rank genes using: Signed fold change × -log10(p-value) This ensures that genes with high fold change and low p-values get ranked at the top, making the enrichment test more meaningful. Important Tip: Always provide all detected genes from your experiment—not just differentially expressed ones.Interpretation of the GSEA figureWe rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO (knock-out) vs WT, (wild-type). If a fold change is positive, that means the gene is up-regulated in KO). so the genes on the top (or left) are the genes with higher expression value in the How to read the figure? The X-axis is all your genes in the experiment (~ 20,000 in this case) pre-ranked by your metric. Each black bar is the gene in this gene set(pathway). You get an idea of where the genes are located in the pre-ranked list. Enrichment Score is calculated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.
You see that most of the black bars representing the genes in this gene set (glycolysis) are in the front of the whole gene list.
The genes for the IL6 pathway are located in the end of the whole gene list. Key Takeaways
Want to dive deeper? and my old write-up here https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/GSEA_explained.md Have you used GSEA in your research? What challenges have you faced? Hit reply and let me know. Other posts from the last week you may find useful
Happy Learning! Tommy aka crazyhottommy PS: If you want to learn Bioinformatics, there are four ways that I can help:
Stay awesome! |
I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!