Moving between species in R
It’s often useful to compare data against a published dataset from another species. These are the most common tasks I complete for this purpose (and the corresponding libraries in R). Unfortunately converting between species always seems to introduce missing identifiers and so I have tried to choose the method which avoids this as much as possible.
- Mapping identifiers to their Ensembl stable ID (using EnsemblDB)
- Find orthologs of genes of interest (using bioMart ) .
- retrieve build specific information about the genome itself (using BSGenome).
Mapping identifiers to their Ensembl stable ID (using EnsemblDB)
- I believe the most reliable and stable idenfification for specific genes is their Ensembl ID so I always convert to ensemblid as intermediary when moving between species, and then use the relevant [EnsDb] package to convert between annotation types. This helps avoid problems with ids that are deprecated, missing, alias, mogrified by excel etc. However even with this, it still seems inevitable to lose a few genes every time conversion happens! Sad.
- species specific gene and protein naming conventions can be found in wikipedia’s gene nomenclature page. In the vertebrate gene and protein symbol conventions subsection, it’s possible to see that the symbols for the gene and protein are the same for species like mouse, just that the gene symbol is italicized. Therefore to obtain the protein symbol associated with some Ensembl ID, it may be more reliable to directly map to symbol from the gene id rather than from the protein id which sometimes is missing.
- You can find some alternative ID conversion approaches e.g. on Biostars most of which actually seem a bit more difficult to me.
UCSC to Ensembl genome builds.
Each genome assembly has corresponding UCSC and Ensembl (and other!) identifiers . For example UCSC genome assemblies are named as hg19, hg38 etc. (for human) while ensembl/NCBI genome assemblies have names such as GRCh38.p12. You can find what the correspondence on the UCSC Assembly releases and versions page.
Example with mouse
- EnsDb comes with the function mapIds(x, keys, column, keytype, …, multiVals) for mapping from a vector to some annotation type, which is a lot like the dplyr mapvalues function for example.
Find orthologs of genes of interest (using bioMart ) .
Even between closely related species such as human and mouse, the gene names for orthologs can be quite different or missing e.g. as found when querying MGI for the mouse genes. Compared to doing this by hand, bioMart seems quicker but is also up to date and works with Ensembl annotations. There is already really good resources on using bioMart such as on Dave Tang’s blog or in the vignette.
Fetching chromosome coordinates corresponding to specific ensembl gene build
- if using biomart’s useMart function without the ‘host’ argument, the default coordinates fetched seems to be from the most recent ensembl genome build.
- you can list the relevant host names as suggested here in section 5 of ‘Using archived versions of Ensembl’
- for example, a short script to obtain a data frame containing attributes of interest for orthologs between mouse and human genes:
retrieve build specific information about the genome itself (using BSGenome).
fetching chromosome sizes
- BSgenome can retrieve build-specific information such as chromosome lengths, as e.g. given below.
- However you do have to install a separate package for each genome of interest e.g. hg19.