Moving between species in R

It’s often useful to compare data against a published dataset from another species. These are the most common tasks I complete for this purpose (and the corresponding libraries in R). Unfortunately converting between species always seems to introduce missing identifiers and so I have tried to choose the method which avoids this as much as possible.

Mapping identifiers to their Ensembl stable ID (using EnsemblDB)
- UCSC to Ensembl genome builds.
- Example with mouse
Find orthologs of genes of interest (using bioMart ) .
- Fetching chromosome coordinates corresponding to specific ensembl gene build
retrieve build specific information about the genome itself (using BSGenome).
- fetching chromosome sizes

Mapping identifiers to their Ensembl stable ID (using EnsemblDB)

I believe the most reliable and stable idenfification for specific genes is their Ensembl ID so I always convert to ensemblid as intermediary when moving between species, and then use the relevant [EnsDb] package to convert between annotation types. This helps avoid problems with ids that are deprecated, missing, alias, mogrified by excel etc. However even with this, it still seems inevitable to lose a few genes every time conversion happens! Sad.
species specific gene and protein naming conventions can be found in wikipedia’s gene nomenclature page. In the vertebrate gene and protein symbol conventions subsection, it’s possible to see that the symbols for the gene and protein are the same for species like mouse, just that the gene symbol is italicized. Therefore to obtain the protein symbol associated with some Ensembl ID, it may be more reliable to directly map to symbol from the gene id rather than from the protein id which sometimes is missing.
You can find some alternative ID conversion approaches e.g. on Biostars most of which actually seem a bit more difficult to me.

UCSC to Ensembl genome builds.

Each genome assembly has corresponding UCSC and Ensembl (and other!) identifiers . For example UCSC genome assemblies are named as hg19, hg38 etc. (for human) while ensembl/NCBI genome assemblies have names such as GRCh38.p12. You can find what the correspondence on the UCSC Assembly releases and versions page.

Example with mouse

EnsDb comes with the function mapIds(x, keys, column, keytype, …, multiVals) for mapping from a vector to some annotation type, which is a lot like the dplyr mapvalues function for example.

# the keytypes of an EnsDb package represent annotations that you can map _from_
> keytypes(EnsDb.Mmusculus.v79)
 [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
 [4] "GENEID"              "GENENAME"            "PROTDOMID"          
 [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[13] "TXBIOTYPE"           "TXID"                "TXNAME"             
[16] "UNIPROTID"          

# the 'column' types of an EnsDb package represent annotations that you can map _to_
> columns(EnsDb.Mmusculus.v79)
 [1] "ENTREZID"            "EXONID"              "EXONIDX"            
 [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
 [7] "GENEID"              "GENENAME"            "GENESEQEND"         
[10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
[13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
[16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
[19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
[22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
[25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
[28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
[31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"

# test data frame containing mouse genes, rat genes and rat protein id
# some genes don't have a corresponding protein id even if they are coding genes!
> head(mm_genes)
      Gene.stable.ID   Gene.stable.ID.1  Protein.stable.ID
1 ENSRNOG00000048485 ENSMUSG00000053765 ENSMUSP00000056993
2 ENSRNOG00000001369 ENSMUSG00000066861                   
3 ENSRNOG00000057769 ENSMUSG00000001168 ENSMUSP00000072297
4 ENSRNOG00000048485 ENSMUSG00000032623 ENSMUSP00000048054
5 ENSRNOG00000048222 ENSMUSG00000074151                   
6 ENSRNOG00000037744 ENSMUSG00000052776 ENSMUSP00000079198

# now map the list of genes from mouse to symbol.
# note the 'keys' are the input and the 'keytype' will give the matches 
# in the case of mouse, protein symbol and gene symbol actually will return 
# the same thing due to nomenclature (except for missing identifiers)
mm_genes$protein_symbol <- mapIds(EnsDb.Mmusculus.v79, as.character(mm_genes$Protein.stable.ID), "PROTEINID", keytype="GENEID" )
mm_genes$gene_symbol <- mapIds(EnsDb.Mmusculus.v79, as.character(mm_genes$Protein.stable.ID), "SYMBOL", keytype="GENEID" )

> head(mm_genes)
   Gene.stable.ID   Gene.stable.ID.1  Protein.stable.ID protein_symbol gene_symbol
1 ENSG00000089127 ENSMUSG00000053765 ENSMUSP00000056993          Oas1f       Oas1f
2 ENSG00000089127 ENSMUSG00000001168 ENSMUSP00000072297          Oas1h       Oas1h
3 ENSG00000089127 ENSMUSG00000066861                              <NA>       Oas1g
4 ENSG00000089127 ENSMUSG00000052776 ENSMUSP00000079198          Oas1a       Oas1a
5 ENSG00000089127 ENSMUSG00000032623 ENSMUSP00000048054          Oas1d       Oas1d
6 ENSG00000132109 ENSMUSG00000030966 ENSMUSP00000033264         Trim21      Trim21

Find orthologs of genes of interest (using bioMart ) .

Even between closely related species such as human and mouse, the gene names for orthologs can be quite different or missing e.g. as found when querying MGI for the mouse genes. Compared to doing this by hand, bioMart seems quicker but is also up to date and works with Ensembl annotations. There is already really good resources on using bioMart such as on Dave Tang’s blog or in the vignette.

Fetching chromosome coordinates corresponding to specific ensembl gene build

if using biomart’s useMart function without the ‘host’ argument, the default coordinates fetched seems to be from the most recent ensembl genome build.
you can list the relevant host names as suggested here in section 5 of ‘Using archived versions of Ensembl’
for example, a short script to obtain a data frame containing attributes of interest for orthologs between mouse and human genes:

# map from mouse to human
# see https://www.biostars.org/p/136775/
require(biomaRt)
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl", 
                host="feb2014.archive.ensembl.org")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl",
                host="feb2014.archive.ensembl.org")


searchAttributes(mart = human, pattern = "hgnc")

# get all the genes that map from one species to another
# the arguments 'attributes, mart' correspond to the 'from' species
# the arguments 'attributesL, martL' correspond to the 'to' species

genes = getLDS(attributes = c("ensembl_gene_id"), 
                 mart = mouse, 
                 attributesL = c("ensembl_gene_id",
                                 "chromosome_name",
                                 "start_position",
                                 "end_position"),
                 martL = human, 
                 uniqueRows=T)

# as you can see from the output below, very few genes out of all genes for a given species are annotated as orthologs.
> length(genes$Gene.stable.ID)
[1] 85957
> length(unique(genes$Gene.stable.ID))
[1] 19927
> length(unique(genes$Gene.stable.ID.1))
[1] 20636

retrieve build specific information about the genome itself (using BSGenome).

fetching chromosome sizes

BSgenome can retrieve build-specific information such as chromosome lengths, as e.g. given below.
However you do have to install a separate package for each genome of interest e.g. hg19.

library(BSgenome.Hsapiens.UCSC.hg19)

head(names(seqlengths(Hsapiens)))
seqlengths(Hsapiens)

Written on August 25, 2019