GMT file import in R

As outlined in the GSEA wiki, GMT files store functional pathway information for genes. However, as the figure there makes clear, each line uses tabs as separators and there may be a different number of genes per line.

To easily import the file in R, they can be read in with the GSA package. This produces a dataframe with three attributes, which are ‘genesets’, ‘geneset.names’ and ‘geneset.descriptions’.

Example with mouse GMT file

Here is an example with the February 2019 version of the GOBP pathway files from the Bader lab Enrichment Map Genesets directory. As described in their website, their gene set files are generated monthly and can be used with their Enrichment Map software. The code for generating these appears to be in their lab github.

#install.packages("GSA")
library(GSA)
# download this from the Bader lab website at:
# 
gmt_file <-GSA.read.gmt(paste(data_dir,
                      "Supplementary_Table3_Mouse_GOBP_AllPathways_no_GO_iea_February_01_2019_symbol.gmt",
                      sep = "/"))


> head(gmt_file$genesets)
[[1]]
[1] "Pycrl"    "Pycr2"    "Pycr1"    "Aldh18a1" ""        

[[2]]
[1] "Mocos" ""     

[[3]]
[1] "Gk2" "Gk5" "Gyk" ""     

> head(gmt_file$geneset.names)
[1] "PROLINE BIOSYNTHESIS I%HUMANCYC%PROSYN-PWY"             
[2] "THIO-MOLYBDENUM COFACTOR BIOSYNTHESIS%HUMANCYC%PWY-5963"
[3] "GLYCEROL DEGRADATION I%HUMANCYC%PWY-4261"               
[4] "MOLYBDENUM COFACTOR BIOSYNTHESIS%HUMANCYC%PWY-6823"

> head(gmt_file$geneset.descriptions)
[1] "proline biosynthesis I"                "thio-molybdenum cofactor biosynthesis"
[3] "glycerol degradation I"                "molybdenum cofactor biosynthesis"     
[5] "oxidative ethanol degradation III"     "tetrapyrrole biosynthesis II"
Written on February 2, 2020