up-level your bioinformatics skill: A leap in thinking about repetition

Published 7 months ago • 2 min read

Hello Bioinformatics lovers,

Happy Thanksgiving! I promise you will learn something about bioinformatics in the end :)

But I want to ask this first: What are you thankful for?

I am grateful for all the people who have supported me in my career;

and also for the setbacks and failures that have taught me invaluable lessons.

Most importantly, I am thankful to YOU, the reader of this newsletter.

I started this newsletter to share the hard-learned experience and tips of my bioinformatics learning journey.

It is you who motivates me to write this newsletter EVERY Saturday consistently. The responses below make my day.

So, Thank you! All the subscribers!

Now, let's dive into today's topic: dealing with repetition in bioinformatics.

what do I mean by repetition?

First, a simple dummy example.
You have a list of numbers 1,2,3,4 and you want to times 2 for each element.

In Python. You can define a function, and apply a function to every element of a list using either:

a for loop
a map function.
list comprehension

Different ways to apply a function to every element of a list in Python

Let me give you a more concrete/useful example.

In the most simple form, you perform an operation/analysis for one sample, now you need to do the same thing for many samples.

For a single sample single-cell RNAseq analysis, you know how to run the Seurat/Scanpy workflow with it easily.

The tutorials are usually easy to follow. But in real-life bioinformatics, the story is different.

You may encounter many samples in separate count matrices in GEO.

*An example of count matrices from a GEO dataset*

The idea is to write a function to read in one sample, and then apply it to all samples.

Create a function in R to read in one count matrix and then apply it to all samples

We defined the read_counts function and then used the purrr::map() to map the function to all samples.

The full script is here. and the YouTube video is Creating a Seurat Object from a GEO Dataset.

Another practical single-cell analysis example, you use simpleaf quant to quantify a sample from fastq to counts:

The command looks like this:

simpleaf quant --reads1 sample_R1.fq --reads2 sample_R2.fq (ommitting other arguments)

Now, you can loop over many samples using shell commands. However, real-life bioinformatics is more complicated.

You usually have more than one fq files for the same sample and you have 20 samples to deal with!

Moreover, the fastq files need to be separated with a comma for the same sample:

simpleaf quant --reads1 sample_L1_R1.fq,sample_L2_R1.fq,sample_L3_R1.fq --reads2 sample_L1_R2.fq,sample_L2_R2.fq,sample_L3_R2.fq

How you automate that?

Read this blog post in which I walk you through a real-life problem for single-cell RNAseq quanitifcation step by step.

Conclusions

When you copy and paste the same code more than 3 times, it is time to write a function.
Apply the same function in python and R using for loops. but at least in R, I use purrr to ditch my for loops.
Remember, computers are good at repetitive things.

It is a little more advanced topic today. You may or may not understand the examples with detailed code. That's FINE!

You at least get the idea why programs are useful and why we are using computers to do repetitive work.

As you grow more experienced, you will understand it and apply those skills in real-world bioinformatics problems.

Happy Learning!

Tommy aka Crazyhottommy

PS:

If you want to learn Bioinformatics, there are four ways that I can help:

My free YouTube Chatomics channel, make sure you subscribe to it. In this new video, I showed you how to liftover hg19 bedpe file to hg38.
I have many resources collected on my github here.
I have been writing blog posts for over 10 years https://divingintogeneticsandgenomics.com/
Lastly, I have a book called "From Cell Line to Command Line". The 40% off is ending in couple of hours.

Stay awesome!

Share this page

Hi! I'm Tommy Tang