如何在 R 中找到这些范围的重叠值?
我有一个名为 ranges
的 df1,例如:
I have a df1 called ranges
like:
1 bin chrom chromStart chromEnd name score
2 12 chr1 836780 856723 -5.7648 599
3 116 chr1 1693001 1739032 -4.8403 473
4 117 chr1 1750780 1880930 -5.3036 536
5 121 chr1 2020123 2108890 -4.4165 415
我也有一个名为 viable
的 data.frame 像:
I also have a data.frame called viable
like:
chrom chromStart chromEnd N
chr1 840000 890000 1566
chr1 1690000 1740000 1566
chr1 1700000 1750000 1566
chr1 1710000 1760000 1566
chr1 1720000 1770000 1566
chr1 1730000 1780000 1566
chr1 1740000 1790000 1566
chr1 1750000 1800000 1566
chr1 1760000 1810000 1566
基本上,我在 ranges
中有从 chromStart 到 chromEnd 的值范围.我在 df2 viable
中还有一个范围列表.viable
中的范围要小得多.我想测试 ranges
的范围,并确保整个范围落在 viable
的范围内.我该怎么做?
Basically I have ranges of values from chromStart to chromEnd in ranges
. I also have a list of ranges in the df2 viable
. The ranges in viable
are much smaller. I want to test the ranges from ranges
and make sure that the full range falls within ranges that are viable
. How can I do this?
我想要的输出是一个 data.frame 像:
The output I want is a data.frame like:
1 bin chrom chromStart chromEnd name score
2 12 chr1 840000 856723 -5.7648 599
3 116 chr1 1693001 1739032 -4.8403 473
6 133 chr1 1750780 1880930 -4.8096 469
您可以尝试使用 GenomicRanges
包.
You could try using the GenomicRanges
package.
library(dplyr)
library(GenomicRanges)
这里我们加载示例输入数据.(这是一种不雅的方式——我知道……但我很懒惰,而且崇高的多行编辑使它变得容易.) 注意:我不知道1"列在哪里意思是,但我把它保存在数据中.
Here we load in the the example input data. (This is an inelegant way to do this -- I know... but I was lazy and the sublime multiline edit made it easy.) Note: I don't know where the "1" column means, but I kept it in the data.
ranges <-
rbind(
c("2","12","chr1","836780","856723","-5.7648","599"),
c("3","116","chr1","1693001","1739032","-4.8403","473"),
c("4","117","chr1","1750780","1880930","-5.3036","536"),
c("5","121","chr1","2020123","2108890","-4.4165","415")
) %>%
as.data.frame()
colnames(ranges) <-
c("1","bin","chrom","chromStart","chromEnd","name","score")
viable <-
rbind(
c("chr1","840000","890000","1566"),
c("chr1","1690000","1740000","1566"),
c("chr1","1700000","1750000","1566"),
c("chr1","1710000","1760000","1566"),
c("chr1","1720000","1770000","1566"),
c("chr1","1730000","1780000","1566"),
c("chr1","1740000","1790000","1566"),
c("chr1","1750000","1800000","1566"),
c("chr1","1760000","1810000","1566")
) %>%
as.data.frame()
colnames(viable) <-
c("chrom","chromStart","chromEnd","N")
## Need columns to be integers
ranges <-
ranges %>%
tbl_df() %>%
mutate(
chromStart = chromStart %>% as.character %>% as.integer,
chromEnd = chromEnd %>% as.character %>% as.integer
)
viable <-
viable %>%
tbl_df() %>%
mutate(
chromStart = chromStart %>% as.character %>% as.integer,
chromEnd = chromEnd %>% as.character %>% as.integer
)
我的回答从这里开始:
- 将数据框重新格式化为 GenomicRanges 类
- 通过交叉找到区域
- 使用
findOverlaps
添加 bin、name 和 score 列.(请注意,此信息在交叉时会被删除,因为不一定是 1:1 映射) - 将输出重新格式化为数据帧
- Reformat dataframe to GenomicRanges class
- Find the regions by doing an intersection
- Add in the bin, name, and score columns using the
findOverlaps
. (Note, this information is removed during the intersection because there is not necessarily a 1:1 mapping) - Reformat output back into a dataframe
完成
gr.ranges <-
makeGRangesFromDataFrame(ranges,
keep.extra.columns = T,
seqnames.field = "chrom",
start.field = "chromStart",
end.field = "chromEnd")
gr.viable <-
makeGRangesFromDataFrame(viable,
keep.extra.columns = T,
seqnames.field = "chrom",
start.field = "chromStart",
end.field = "chromEnd")
# To find the intersects
gr.intersect <-
GenomicRanges::intersect(gr.ranges, gr.viable)
# For linking up the non- chrom,start,end columns
gr.hits <-
GenomicRanges::findOverlaps(gr.intersect, gr.ranges)
output <-
gr.intersect[queryHits(gr.hits)]
mcols(output) <-
mcols(gr.ranges[subjectHits(gr.hits)])
output
# Reformat to dataframe
output %>%
as.data.frame() %>%
select(`1` = X1, bin, chrom = seqnames, chromStart = start, chromEnd = end, name, score)