有没有一种有效的方法可以根据经度和纬度对附近的位置进行分组?

问题描述:

我正在尝试找到一种基于接近度对多个地址进行聚类的方法。我具有经度和纬度,在这种情况下,这是理想的选择,因为某些集群将跨越城市/邮政编码边界。我的起点与此类似,但表中最多10,000行:

I'm trying to figure out a way to cluster multiple addresses based on proximity. I have latitude and longitude, which in this case is ideal, as some of the clusters would cross City/Zip boundaries. What I would have as a starting point is similar to this, but up to 10,000 rows within the table:

Hospital.Addresses <- tibble(Hospital_Name = c("Massachusetts General Hospital","MGH - Blake Building","Shriners Hospitals for Children — Boston","Yale-New Haven Medical Center", "Memorial Sloan Kettering", "MSKCC Urgent Care Center", "Memorial Sloan Kettering Blood Donation Room"),
  Address = c("55 Fruit St", "100 Blossom St", "51 Blossom St", "York St", "1275 York Ave", "425 E 67th St", "1250 1st Avenue Between 67th and 68th Streets"),
  City = c("Boston", "Boston", "Boston", "New Haven", "New York", "New York", "New York"),
  State = c("MA", "MA", "MA", "CT", "NY", "NY","NY"),
  Zip = c("02114","02114","02114", "06504", "10065", "10065", "10065"),
  Latitude = c(42.363230, 42.364030, 42.363090, 41.304507, 40.764390, 40.764248, 40.764793),
  Longitude = c(-71.068680, -71.069430, -71.066630, -72.936781, -73.956810, -73.957127, -73.957818))

我想将彼此相距约1英里以内的地址组聚类,则可能无需计算10,000个单个点之间的Haversine距离。我们可以简化数学运算,并大致估计1英里为经度或纬度的0.016度。

I would like to cluster the groups of addresses that are within ~1 mile of each other, potentially without calculating the Haversine distance between 10,000 individual points. We could potentially make the math easy and roughly estimate 1 mile as 0.016 degrees of either latitude or longitude.

理想的输出结果是可以验证波士顿的3个医院位置在第1组中(彼此之间相距1英里以内),纽黑文的医院位于它属于第2组(不在其他任何地方1英里之内),而纽约的3个医院位置都在第3组(彼此之间都在1英里之内)。

An ideal output would be something that validates the 3 hospital locations in Boston are in Group 1 (all within 1 mile of each other), the hospital in New Haven is on it's own in Group 2 (not within 1 mile of anything else), and the 3 hospital locations in NY are all in Group 3 (all within 1 mile of each other).

我不是在寻找group_by(),而是在寻找group_near()。

Instead of group_by(), I'm more looking for group_near().

任何建议都很大

实际上,来自geosphere软件包的 distm 函数可以处理在短短几分钟内就完成了10,000对,与编写此解决方案所花费的时间相比,在我的机器上还算不错。 10,000个随机点的dist矩阵消耗的内存少于一个演出。

Actually the distm function from the geosphere package can handle 10,000 pairs in just a couple of minutes, on my machine not terribly bad compared to the time it took to write this solution. The dist matrix for 10,000 random points consumed less than a gig of memory.

使用 hclust 进行聚类并使用

#create fake data
lat<-runif(10000, min=28, max=42)
long<-runif(10000, min=-109, max=-71)
df<-data.frame(long, lat)

library(geosphere)

start<-Sys.time()
#create a distance matrix in miles
dmat<-distm(df)/1000*.62
print(Sys.time()-start)

#cluster
clusted<-hclust(as.dist(dmat))
#plot(clusted)
#find the clusters ids for 2 mile distances
clustersIDs<-(cutree(clusted, h=2))