profile

Hi! I'm Tommy Tang

Python or R for bioinformatics?


Hello Bioinformatics lovers,

I wish you a blissful Saturday! If you need to pick Python or R for bioinformatics, which one should you choose?

This is my story.

I started learning Unix Commands 12 years ago (See an example of how powerful Unix commands can be). I then picked up Python using "Python for absolute beginners". It was a great book, however...

I did my first print("hello world") and learned the syntax of the language.. It is not that practical in terms of solving practical bioinformatics problems that I had. why?

I had a lot of tables/spreadsheets to analyze during that time, and learning pandas in Python was just not that smooth. Then, I found R..

R is perfect for rectangular tables and has built-in support for dataframes. With the more recent tidyverse, it is so much easier to do complex data wrangling with a few lines of code. ggplot2 made figures are still better than in Python.

Moreover, R has the Bioconductor ecosystem which contains thousands of packages for bioinformatics. You can simply search "xxx data analysis, Bioconductor" and find packages.

For example, bulk RNAseq analysis still widely uses the DESeq2 package.

Python is a general programming language and has its strength in deep learning but may have fewer pre-built bioinformatics packages. Some people feel it is more intuitive to learn but some think R is easier. R has other problems too.

You need to load all the data into memory in R (you can read the file line by line and process line by line in Python). so when the data is big, you need a lot of RAM to do the analysis.

For whole genome bisulfite sequencing data , you can have 20 million rows x 50 samples. It is not practical to load it into memory.

There is a more recent DelayedArray to solve this problem though.

Python and R both have their own pros and cons. If you can, learn both and use one that is suitable for the task at hand.

So if you are a biologist who wants to do a quick analysis with a pre-built package, I think R is better. But if you want to be a serious programmer (developing algorithms), learn Python too.

Read this blog post on how to solve the same data formatting problem using Python and R:

https://divingintogeneticsandgenomics.com/post/how-to-separate-a-comma-delimited-string-into-multiple-lines-in-r-and-python/

Happy Learning!

Tommy

PS: If you want to learn bioinformatics and do not know where to start, my book "From Cell Line to Command Line" can help you.

Let's connect on twitter and Linkedin!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page