profile

Hi! I'm Tommy Tang

Develop a data sense


Hello Bioinformatics lovers,

It is a beautiful day here. I had a dilemma: should I sit and write this newsletter or go out and play with the kids?

I chose the latter. That's why it is a little late when you get this.

I love teaching and I also love my family.

Today we will talk about how to develop a data sense. What do I mean?

It would help if you verified whatever you get from a command by:

  1. Some intuition
  2. looking at the data with your eyes
  3. use a simpler example that you can understand

I have talked about exploratory data analysis (EDA) before. Let me give you another example.

I have an RNAseq count matrix with rows of genes and columns of samples. I need to normalize it to counts per million (CPM).

The calculation is simple:

  • you get the sum of each column (for each sample).
  • you divide each column by that column Sum
  • you time that with 10^6 (a million)

In R, you can do it by: t(t(mat)/colSums(mat))

The key is transposing the matrix first, dividing the column sum, and then transposing it back.

By default, if you divide a matrix by a vector, it is in a row-wise manner.

If you want to further normalize to it RPKM (reads per kilobase per million), you need to divide each by the length of the gene for each row (gene). If you have another vector with gene length in the same order as the matrix rows.

t(t(mat)/colSums(mat))/ gene_length

In this case, you can just divide the matrix by the gene_length vector, because the division is row-wise.

However, you may not know the division is by rows and assume the division is by column. Then you may make a mistake!

Instead, you can use a dummy matrix with only 3 rows, so you can see the calculation with your own eyes!

That's exactly what I did. Watch this YouTobe Video.

The key takeaway is always verifying your results.

Happy Learning!

Tommy

PS:

If you want to learn Bioinformatics, there are three ways that I can help:

  1. My free YouTube Chatomics channel, make sure you subscribe to it.
  2. I have many resources collected on my github here.
  3. I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/

Stay awesome!

Hi! I'm Tommy Tang

I am a computational biologist with six years of wet lab experience and over ten years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter! https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources

Share this page