profile

Hi! I'm Tommy Tang

Most neglected skill: organize your computational biology project


Hi Bioinformatics lovers,

Welcome, all the new subscribers! Every Saturday you should expect to receive this short newsletter from me.

Today, we will cover one important topic that is less taught: how to organize your computational biology project. If you prefer to watch videos, watch it here.

I highly recommend you read this paper: A Quick Guide to Organizing Computational Biology Projects.

For every project, you should have a consistent folder structure. You do not have to follow exactly the same as I do. I usually have a data folder, a scripts folder, a results folder and a doc folder.

The data in the data folder should be immutable (use chmod u-w -R data/) to remove your writing privilege in the data folder.

write scripts in the script folder. The scripts should read the data in the data folder and spit out the results (figures, intermediate tables) in the results folder.

version control your scripts with git and keep a copy in the github. For each project document everything (have a READEME file in each folder). How you download the data, what's the purpose of the project/scripts; how you manually edit a file if it is inevitable (Always write scripts to change files).

Remember, it will help the future you! You will forget many details but if you document it well, you can pick it up even in a couple of years! Those efforts can get paid off long term.

Other resources:

  1. use random forest and boost trees to find marker genes in scRNAseq data https://divingintogeneticsandgenomics.com/post/use-random-forest-and-boost-trees-to-find-marker-genes-in-scrnaseq-data/
  2. A review of computational strategies for denoising and imputation of single-cell transcriptomic data https://academic.oup.com/bib/article/22/4/bbaa222/5916940?login=false
  3. There is this old interview in which I shared how I learned bioinformatics.
  4. If you want to hone your bioinformatics skills, reproducing a figure is the way to go. read this: Introducing Figure One Lab (F1L) https://www.linkedin.com/pulse/introducing-figure-one-lab-f1l-dean-lee-u0wff/?trackingId=2AtVLY%2FhLqtg41HnbllB2Q%3D%3D
  5. Tidymodels Machine Learning: Diabetes Classification https://boiled-data.github.io/ClassificationDiabetes.html
  6. A blueprint for tumor-infiltrating B cells across human cancers | Science https://www.science.org/doi/10.1126/science.adj4857
  7. demuxSNP: supervised demultiplexing scRNAseq using cell hashing and SNPs https://www.biorxiv.org/content/10.1101/2024.04.22.590526v1
  8. Cell2TCR is a tool for inference of T cell receptor (TCR) motifs. A TCR motif describes a group of TCRs with sufficient sequence similarity to likely recognise a common epitope. https://github.com/Teichlab/cell2tcr

The best way to learn is to teach. Because if I want to teach, I need to understand it to a deeper level first.

Happy Learning!

Tommy

Let's connect on twitter and Linkedin!

Hi! I'm Tommy Tang

I am a bioinformatician/computational biologist with six years of wet lab experience and over 12 years of computation experience. I will help you to learn computational skills to tame astronomical data and derive insights. Check out the resources I offer below and sign up for my newsletter!

Share this page