500 of your "significant" genes are fake

Published 2 months ago • 1 min read

Hello Bioinformatics lovers,

Tommy here. One of the key understanding when I was doing RNAseq analysis is multiple testing adjustment.

Under the null hypothesis, with no real difference, a p-value is a random variable which follows a uniform distribution.

This means about 5% of genes with p < 0.05 by chance. 1% of genes with p< 0.01 just by chance even there is no real differences.

Run 10,000 genes where nothing changes and you collect 500 false hits before you reach one real signal.

Picture 10,000 jars of jellybeans. You taste one bean from each jar, hunting for your favorite flavor.

No jar contains it. You still walk away convinced that 500 jars did. You make the same mistake with a gene list.

Bonferroni: the strict fix. Divide 0.05 by 10,000 and call a gene real only below that tiny threshold. You block most false positives. You also discard most of your real discoveries. GWAS needs that severity. You cannot afford it in a typical expression study.

FDR: the smarter fix. The False Discovery Rate caps the fraction of bad calls among your hits rather than blocking each one. Set FDR to 5%. Report 100 genes and you expect about 5 to be wrong. You keep your discoveries and drop most of the junk.

The q-value: per gene. FDR describes a whole list. The q-value applies to one gene. A q-value of 0.01 says that to include this gene, you accept a list where 1% of the hits are false. You report that number instead of a bare “p < 0.05.”

In R, one line each:p.adjust(pvals, method = "fdr") # Benjamini-Hochberg qvalue::qvalue(pvals)$qvalues # Storey's q-value, less conservative

Both control false discoveries. The second estimates the share of your genes with no real effect, so it recovers a few more real hits.

Correct for multiple testing and your reviewers trust your gene list while your collaborators stop chasing ghosts.

Full breakdown, read my blog post: https://divingintogeneticsandgenomics.com/post/understanding-p-value-multiple-comparisons-fdr-and-q-value/

qvalue bioconductor package tutorial https://www.bioconductor.org/packages/release/bioc/html/qvalue.html

Happy Learning,

Tommy aka crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
Lastly, I post daily on Linkedin and you may find useful posts https://www.linkedin.com/in/%F0%9F%8E%AF-ming-tommy-tang-40650014/recent-activity/all/

Stay awesome!

PPS:

Share this page

Chatomics! — The Bioinformatics Newsletter

500 of your "significant" genes are fake

Chatomics! — The Bioinformatics Newsletter