profile

Hi! I'm Tommy Tang

My 2 cents on coding when I was a bioinformatics beginner


Hello Bioinformatics lovers,

First of all, welcome all the new subscribers. I hope my little writings can help you to advance your bioinformatics skills. Also, please forward it to your friends if you think this newsletter can help them too!

I wrote this post in 2015, and I want to re-share it with you.


I was trained as a wet biologist and I started learning coding in April, 2012 with my first ever python book: python programming for absolute beginners. I still remember the days that after work I would sit down in front of the computer and go through the book until 10pm everyday.


It was not that practical in terms of translating what I have learned to what I want to analyze in the lab, but still I have entered into a new world! Watch this podcast interview if you want to learn more about my journey.


In the Fall semester of 2012, I took a beginner bioinformatics course at University of Florida using practical computing for biologists as a reference book.

It is a great book and it taught me regular expression, Unix commands and some python stuffs that directly related to biology.

I was deeply attracted by the beauty of codes and was surprised/satisfied that how useful learning coding can be.


Lessons I learned from that class: Regular expression is extremely useful! At least one needs to know the basics and you can then always google and find solutions there.


Bioinformatics is a field that evolves so fast that many tools you use may become obsolete tomorrow.

However, unix skills will never fade. I urge every wet biologist like me to learn Unix commands first.

It will take time for you to be fluent in the terminal. It took me 2 years to feel really confortable working in the terminal, so stop worrying and take your time.


Statistical programming language R is very popular in the bioinformatics field. I started using R because I can take advantage of the rich packages in bioconductor.

I started from the basics with The art of R programming. After getting the basics, learn to use packages like dplyr, ggplot2 will greatly reduce the complexity of your code and enhancer your productivity.

Surprisingly, all these awesome packages were developed by the same person: Hadley Wikham.

Learn some git. Git is a version control system that tracks your code. I am still a beginner, but I realized how important it is to version control my codes.

For this reason, I have a github repo where I put my codes. I am still learning git everyday.


When the project grows big, you need to well manage it. There are several resources that I recommend you to read before any project:
1. A Quick Guide to Organizing Computational Biology Projects

2. Designing project by Vince Buffalo Vince Buffalo has a book which I highly recommend for everyone: Bioinformatics data skills. It covers many points that I want to say in this post. I might write a review on it after finishing all the chapters.


3. Best Practices for Scientific Computing
The take home message for me is that it is not enough for you to just run the code, get some results and then publish them.


One needs to be aware that:

1. Computers make mistakes. They can give you non-sense results and exit without error, so make extensive tests before running your code.


2. Share your codes. Even your codes are correct, you need to share them so that other people can look at them and may improve them.


3. Make your codes reusable. Do not hard code your scripts. If it takes a file path as input, make it as an argument in your scripts.


4. Modulate your scripts. Data could come in different stage of formats. Take ChIP-sequencing data analysis as an example, if you have a script that starts processing the data from fastq to the final peaks. You may want to modulate your scripts to two modules: one for mapping fastq to bam, and the other for bam to peaks. Modulate your scripts so that one can use your script when the data come in a bam format.


5. Heavily comment your scripts. It will not only make other people to understand your codes better, but also help the future you to understand what you did.

6. You need to make your analysis reproducible. Each step of your analysis should be documented in a markdown file. I say every step, yes, every command that you strike in the terminal getting the intermediate files need to be taken down. Moreover, how, when and where did you download the data need to be documented. This will save the future you! Many experienced programmers overlook this point.

Happy Learning!

Tommy aka. Crazyhottommy

PS: I expanded my last newsletter on Genomic Intervals to a blog post with practical examples. and made a YouTube video on it.

PPS:

If you want to learn Bioinformatics, there are four ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
  4. Lastly, I have a book called "From Cell Line to Command Line" to teach you bioinformatics.

Stay awesome!

Hi! I'm Tommy Tang

I am a computational biologist with six years of wet lab experience and over ten years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter! https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources

Share this page