profile

Chatomics! — The Bioinformatics Newsletter

New post from chatomics! How to create a GenomicRanges object in Bioconductor using canonical transcripts


Hello Bioinformatics lovers,

I enabled my blog https://divingintogeneticsandgenomics.com/#posts RSS feed.

So whenever I have a new blog post, it will be sent to your email.

This is different from my weekly Saturday newsletter.

My blog posts are mostly technical tutorials.

This is new update from my blog:

How to create a GenomicRanges object in Bioconductor using canonical transcripts

Published on July 15, 2025

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

Introduction to Annotation Data Packages in Bioconductor

Accurate gene and transcript annotation is the foundation of many bioinformatics workflows, including RNA-seq analysis, functional genomics, and variant annotation.

In the R/Bioconductor ecosystem, dedicated annotation data packages make it easy for researchers to access, query, and leverage gene models sourced from major biological databases.

Understanding the origin and structure of these annotation packages—such as those based on UCSC, Ensembl, and other reference sets—is essential for reproducibility and clarity in your analyses.

Key Annotation Databases and Their Packages

  • TxDb.Hsapiens.UCSC.hg38.knownGene
    Built using gene models from the UCSC Genome Browser’s “knownGene” track for the human hg38 assembly. UCSC integrates gene predictions from several sources, including RefSeq, GenBank, and Ensembl, but curates its own transcript IDs and structures.

  • EnsDb Data Packages
    Based on Ensembl, a comprehensive genome annotation resource. EnsDb packages (created via the ensembldb package) are derived directly from Ensembl’s annotation of genes, transcripts, and proteins for specific genome builds. They reflect Ensembl’s standardized data model and genomic coordinates.

Major Transcript Reference Systems

Different annotation resources define transcripts differently, and “canonical” transcripts can refer to the preferred or most representative isoforms per gene. Here are some prominent systems:

Reference System Description Canonical Selection
UCSC (knownGene) Aggregates gene predictions, uses its own transcript IDs; many overlaps with RefSeq/Ensembl UCSC-specific rules
Ensembl Own annotation pipeline; covers standard and alternative transcripts, updated regularly Ensembl canonical
RefSeq Curated by NCBI; focuses on high-confidence, experimentally validated transcripts RefSeq Select
MANE “Matched Annotation from NCBI and EMBL-EBI”; a collaboration to produce a single transcript per protein-coding gene, identical in both RefSeq and Ensembl MANE Select

Additional Notes

  • RefSeq annotations are highly curated, with a focus on stability and biological accuracy.
  • MANE transcripts provide a harmonized “best” transcript for each gene, facilitating cross-platform comparisons and clinical reporting.
  • Canonical transcripts in Ensembl and UCSC are assigned based on expression, conservation, clinical relevance, or other attributes; the computational definitions may differ.

Why These Differences Matter

Selecting the appropriate annotation resource in your analysis affects:

  • Transcript/gene definitions: Variations in exon boundaries, IDs, and number of isoforms.
  • Reproducibility: Results may change depending on annotation source/version.
  • Interpretation: Clinical applications and downstream analyses (e.g., variant annotation) can be impacted by transcript choice.

In practice, choosing the right isoform of the mRNA can matter a lot for annotating your variants, the protein amino acid changes can be different based on the isoforms that are selected.

When annotating peaks for ChIP-seq data. Peaks have different distances among different isoforms.

Create a TxDb object using canonical transcripts only

library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(AnnotationHub)
library(ensembldb)
# load hg38 transcript
# directly filter canonical transcripts from the Ensembl database
ah<- AnnotationHub(localHub = FALSE)
edb<- ah[["AH98047"]]
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.7
#> |Creation time: Sat Dec 18 14:48:15 2021
#> |ensembl_version: 105
#> |ensembl_host: localhost
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.2
#> | No. of genes: 69329.
#> | No. of transcripts: 268255.
#> |Protein data available.
transcripts(edb)
#> GRanges object with 268255 ranges and 11 metadata columns:
#>                   seqnames            ranges strand |           tx_id
#>                      

There is a metadata column called tx_is_canonical to show if the transcript is canonical or not.

# Get transcripts directly from the EnsDb object with canonical filter
hg38_transcripts<- transcripts(edb, filter = ~ tx_is_canonical == TRUE)
# only the canonical transcripts are retained. tx_is_canonical are all 1s
hg38_transcripts
#> GRanges object with 69329 ranges and 11 metadata columns:
#>                   seqnames            ranges strand |           tx_id
#>                      

There are unconventional chromosome names such as LRG_239,

seqnames(hg38_transcripts)
#> factor-Rle of length 69329 with 456 runs
#>   Lengths:                       5702 ...                        522
#>   Values : 1                          ...         Y                 
#> Levels(456): 1 10 11 12 13 14 15 16 ... LRG_763 LRG_792 LRG_793 LRG_93 MT X Y
# only keep the standard names
hg38_transcripts<- hg38_transcripts %>%
  keepStandardChromosomes(pruning.mode = "coarse")
seqnames(hg38_transcripts)
#> factor-Rle of length 62800 with 25 runs
#>   Lengths: 5702 2429 3488 3152 1448 2358 2295 ... 3132 2554 2417   37 2522  522
#>   Values :   1    10   11   12   13   14   15 ...   7    8    9    MT   X    Y 
#> Levels(25): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y

rename chromosome from 1,2,3 to chr1, chr2, chr3 etc so it can overlap with the peaks if your peaks are in chr1 start end format.

seqlevels(hg38_transcripts)
#>  [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22"
#> [16] "3"  "4"  "5"  "6"  "7"  "8"  "9"  "MT" "X"  "Y"
# prefix chr
new_seqnames <- paste0("chr", seqlevels(hg38_transcripts))
names(new_seqnames) <- seqlevels(hg38_transcripts)
new_seqnames
#>       1      10      11      12      13      14      15      16      17      18 
#>  "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" 
#>      19       2      20      21      22       3       4       5       6       7 
#> "chr19"  "chr2" "chr20" "chr21" "chr22"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7" 
#>       8       9      MT       X       Y 
#>  "chr8"  "chr9" "chrMT"  "chrX"  "chrY"
hg38_transcripts <- renameSeqlevels(hg38_transcripts, new_seqnames)
# add the gene symbol using database mapping for better accuracy
hg38_transcripts$SYMBOL <- mapIds(edb,
                                 keys = hg38_transcripts$tx_id,
                                 column = "SYMBOL",
                                 keytype = "TXID")
hg38_transcripts
#> GRanges object with 62800 ranges and 12 metadata columns:
#>                   seqnames            ranges strand |           tx_id
#>                      

Now we have a GenomicRanges object with only the canonical transcripts!! You can then use it to annotate your ChIP-seq peaks for example.

In my next blog post, I am going to show you how to calculate regulatory potential score using ChIP-seq peaks and this canonical transcripts annotation.

Let’s take a look at the UCSC annotation.

txdb<- TxDb.Hsapiens.UCSC.hg38.knownGene
UCSC_transcripts<- transcripts(txdb)
UCSC_transcripts
#> GRanges object with 276905 ranges and 2 metadata columns:
#>                       seqnames        ranges strand |     tx_id
#>                          
head(seqlevels(UCSC_transcripts), n = 30)
#>  [1] "chr1"                "chr2"                "chr3"               
#>  [4] "chr4"                "chr5"                "chr6"               
#>  [7] "chr7"                "chr8"                "chr9"               
#> [10] "chr10"               "chr11"               "chr12"              
#> [13] "chr13"               "chr14"               "chr15"              
#> [16] "chr16"               "chr17"               "chr18"              
#> [19] "chr19"               "chr20"               "chr21"              
#> [22] "chr22"               "chrX"                "chrY"               
#> [25] "chrM"                "chr1_GL383518v1_alt" "chr1_GL383519v1_alt"
#> [28] "chr1_GL383520v2_alt" "chr1_KI270759v1_alt" "chr1_KI270760v1_alt"

we see strange chromosome names such as chr1_GL383518v1_alt.

The human reference assembly includes not just the main chromosomes (e.g., chr1, chr2…) but also alternate loci and fix patches.

“alt” contigs cover areas of the genome where population-level structural variation or highly polymorphic regions cannot be represented by a single, linear reference.

Each name describes: The primary chromosome (chr1)

The unique contig identifier (GL383518v1, KI270759v1, etc.; these are GenBank accession numbers with version)

The “_alt” suffix, indicating an alternate locus or haplotype scaffold.

Also, this transcripts annotation does not have gene symbols and you may need to map it using org.Hs.eg.db package.

get all the genes representation

genes(txdb)
#> GRanges object with 30733 ranges and 1 metadata column:
#>             seqnames              ranges strand |     gene_id
#>                

Note this function collapse all the transcripts from the same gene into a single GenomicRanges. This may or may not be what you want. It effectively getting the longest isoform of that gene.

Conclusion

When building workflows in Bioconductor, it’s crucial to know the origins and intentions behind annotation data packages. Whether you rely on UCSC, Ensembl, RefSeq, or MANE, your chosen dataset will shape how genes and transcripts are referenced throughout your research. Always specify your annotation sources for transparent, reproducible results.

Read more...

Happy Learning!

Tommy aka. crazyhottommy

Chatomics! — The Bioinformatics Newsletter

Why Subscribe?✅ Curated by Tommy Tang, a Director of Bioinformatics with 100K+ followers across LinkedIn, X, and YouTube✅ No fluff—just deep insights and working code examples✅ Trusted by grad students, postdocs, and biotech professionals✅ 100% free

Share this page