profile

Hi! I'm Tommy Tang

what's the most undervalued skill in computational biology?


Hello Bioinformatics lovers,

What's the most undervalued skill in computational biology?

It is EDA (Exploratory Data Analysis)

What is EDA? According to wiki:

In ​statistics​, exploratory data analysis (EDA) is an approach of ​analyzing​ ​data sets​ to summarize their main characteristics, often using ​statistical graphics​ and other ​data visualization​ methods

In plain words, it is using summary statistics and visualization (figures) to understand your data BEFORE you do any data analysis. Why is it important?

  1. you want to inspect the variable names (column names). you can usually use head() or tail() function in R or command line to take a look at the first or the last several rows.
  2. use table() function to see how many different values are available for a column. If you have a dataframe (df) and a column named "gender", and you do table(df$gender), you see "F, M, female, male", you know you need to fix that column first. You should always use table(df$gender, useNA="ifany") to make sure you know there are missing values in a certain column.
  3. Make histograms to check the distribution of a column and get a sense of the data range and distribution. Use summary statistics such as quantile() to see some numbers. If you have outliers, when you map a number to color (in a heatmap), you will need to cap the value or the outlier will throw off your heatmap color scale. Watch this video to understand heatmap.
  4. use PCA (principal component analysis, see an example.) to check if there are any outliers of the samples, swap of the samples, or batch effect. If you have control samples and treatment samples clustered together, it could be that you process them on the same day.

I termed it "data intuition". Through EDA and looking at the data with your eyes (by printing out some rows), you should find the irregularity of your data.

This means that you should have some intuition when you feel something is wrong during the data analysis. The reality is:

If you feel something is off, there is usually something wrong with your code or your understanding.

Instead of blindly following any tutorial or executing any command, make sure it makes sense by taking a second thought of the output. Let me explain more.

If you have sequenced the female samples, and you find variant calls (mutations) on the Y chromosome, you know something is wrong.

Do not blindly follow any tutorial, make sure you understand the details. That's all for it today.

Happy Learning!

Tommy aka. Crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
  4. Lastly, I have a book called "From Cell Line to Command Line" to teach you bioinformatics.

Stay awesome!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page