问题描述：

我有一个名为 ranges 的 df1，例如:

I have a df1 called ranges like:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   836780    856723    -5.7648   599
3    116  chr1   1693001   1739032   -4.8403   473
4    117  chr1   1750780   1880930   -5.3036   536
5    121  chr1   2020123   2108890   -4.4165   415

我也有一个名为 viable 的 data.frame 像:

I also have a data.frame called viable like:

   chrom   chromStart  chromEnd        N
chr1      840000       890000       1566
chr1      1690000      1740000      1566
chr1      1700000      1750000      1566
chr1      1710000      1760000      1566
chr1      1720000      1770000      1566
chr1      1730000      1780000      1566
chr1      1740000      1790000      1566
chr1      1750000      1800000      1566
chr1      1760000      1810000      1566

基本上，我在 ranges 中有从 chromStart 到 chromEnd 的值范围.我在 df2 viable 中还有一个范围列表.viable 中的范围要小得多.我想测试 ranges 的范围，并确保整个范围落在 viable 的范围内.我该怎么做?

Basically I have ranges of values from chromStart to chromEnd in ranges. I also have a list of ranges in the df2 viable. The ranges in viable are much smaller. I want to test the ranges from ranges and make sure that the full range falls within ranges that are viable. How can I do this?

我想要的输出是一个 data.frame 像:

The output I want is a data.frame like:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   840000    856723    -5.7648   599
3    116  chr1   1693001   1739032   -4.8403   473
6    133  chr1   1750780   1880930   -4.8096   469

答

您可以尝试使用 GenomicRanges 包.

You could try using the GenomicRanges package.

library(dplyr)
library(GenomicRanges)

这里我们加载示例输入数据.(这是一种不雅的方式——我知道……但我很懒惰，而且崇高的多行编辑使它变得容易.) 注意:我不知道1"列在哪里意思是，但我把它保存在数据中.

Here we load in the the example input data. (This is an inelegant way to do this -- I know... but I was lazy and the sublime multiline edit made it easy.) Note: I don't know where the "1" column means, but I kept it in the data.

ranges <-
  rbind(
    c("2","12","chr1","836780","856723","-5.7648","599"),
    c("3","116","chr1","1693001","1739032","-4.8403","473"),
    c("4","117","chr1","1750780","1880930","-5.3036","536"),
    c("5","121","chr1","2020123","2108890","-4.4165","415")
  ) %>% 
  as.data.frame()
colnames(ranges) <-
  c("1","bin","chrom","chromStart","chromEnd","name","score")

viable <-
  rbind(
    c("chr1","840000","890000","1566"),
    c("chr1","1690000","1740000","1566"),
    c("chr1","1700000","1750000","1566"),
    c("chr1","1710000","1760000","1566"),
    c("chr1","1720000","1770000","1566"),
    c("chr1","1730000","1780000","1566"),
    c("chr1","1740000","1790000","1566"),
    c("chr1","1750000","1800000","1566"),
    c("chr1","1760000","1810000","1566")
  ) %>%
  as.data.frame()
colnames(viable) <-
  c("chrom","chromStart","chromEnd","N")

## Need columns to be integers
ranges <-
  ranges %>%
  tbl_df() %>%
  mutate(
    chromStart = chromStart %>% as.character %>% as.integer,
    chromEnd = chromEnd %>% as.character %>% as.integer
    )
viable <-
  viable %>%
  tbl_df() %>%
  mutate(
    chromStart = chromStart %>% as.character %>% as.integer,
    chromEnd = chromEnd %>% as.character %>% as.integer
    )

我的回答从这里开始:

将数据框重新格式化为 GenomicRanges 类
通过交叉找到区域
使用 findOverlaps 添加 bin、name 和 score 列.(请注意，此信息在交叉时会被删除，因为不一定是 1:1 映射)
将输出重新格式化为数据帧

Reformat dataframe to GenomicRanges class
Find the regions by doing an intersection
Add in the bin, name, and score columns using the findOverlaps. (Note, this information is removed during the intersection because there is not necessarily a 1:1 mapping)
Reformat output back into a dataframe

完成

gr.ranges <-
  makeGRangesFromDataFrame(ranges,
                           keep.extra.columns = T,
                           seqnames.field = "chrom",
                           start.field = "chromStart",
                           end.field = "chromEnd")
gr.viable <-
  makeGRangesFromDataFrame(viable,
                           keep.extra.columns = T,
                           seqnames.field = "chrom",
                           start.field = "chromStart",
                           end.field = "chromEnd")

# To find the intersects
gr.intersect <-
  GenomicRanges::intersect(gr.ranges, gr.viable)

# For linking up the non- chrom,start,end columns
gr.hits <-
  GenomicRanges::findOverlaps(gr.intersect, gr.ranges)

output <-
  gr.intersect[queryHits(gr.hits)]
mcols(output) <-
  mcols(gr.ranges[subjectHits(gr.hits)])
output

# Reformat to dataframe
output %>%
  as.data.frame() %>%
  select(`1` = X1, bin, chrom = seqnames, chromStart = start, chromEnd = end, name, score)

如何在 R 中找到这些范围的重叠值?

我的回答从这里开始:

相关推荐