Working with granges
GRanges seems to be the most standard data structure to represent genomic coordinates in R, supported by Bioconductor’s GenomicRanges package. The vignette describes useful associated attributes and functions such as names for each region.
Note if you’re converting between them, bed files have 0-based indexing, while granges have 1-based indexing.
- Make a granges object
- Enumerating Overlaps
- Find the coverage over the genome of a bed file in base pairs
Make a granges object
- This can be done by assembling the data from lists with a call to GRanges() or from a data frame, as below.
- For snps, it seems like the coordinates as both start and end works fine.
ranges <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=head(letters,10)), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) snps <- makeGRangesFromDataFrame(jansen_df)
- findOverlaps returns a granges of overlapping intervals, while countOverlaps returns the number of overlaps for each element. This means their outputs are usually different.
> length(GenomicRanges::intersect(snps, ranges, + ignore.strand = TRUE))  1167019 > length(findOverlaps(snps,ranges, ignore.strand = TRUE))  1186791 > length(countOverlaps(snps,ranges, ignore.strand = TRUE))  13367299
Find the coverage over the genome of a bed file in base pairs
- reduce() can be used to eliminate double counting of regions.
- width of a granges object seems to give number of base pairs from each entry combined.
> sum(width(ranges))  263751516 > View(snps) > sum(width(GenomicRanges::reduce(ranges)))  263608754