有效地分组节点?

问题描述:

While most questions are about grouping nodes based on similarity (pidgeonholes), I would like to group nodes based on simply their proximity.

I have a large, dense collection of nodes- Potentially millions. On-screen they take up some amount of space, so they can be thought of as having a size.

What I am trying to do is to group these nodes into single containing nodes efficiently, both in processing time and also in collecting more nodes per container.

My current attempts have either been too slow, or didn't work, but are all based off of the same solution I have in mind: Calculate a lot of possible containers by taking a node and it's surrounding nodes at random and grouping them, then picking the most effective container.

What are your ideas, not specifically in any language, but I will be using PHP or JavaScript for this.

Edit

I forgot to mention that the nodes will be streamed in, so it needs to accept unlimited nodes, putting them into containers as they come, creating new containers or even deleting them as necessary, for up to millions of containers. That would be the most ideal.

虽然大多数问题都是基于相似性(pidgeonholes)对节点进行分组,但我想基于它们对节点进行分组 接近。 p>

我有一个庞大而密集的节点集合 - 可能有数百万。 在屏幕上,它们会占用一些空间,因此可以认为它们具有一定的大小。 p>

我要做的是将这些节点有效地分组为包含单个节点, 处理时间以及每个容器收集更多节点。 p>

我目前的尝试要么太慢,要么不起作用,但都是基于我想到的相同解决方案:通过获取节点计算大量可能的容器 它是随机的周围节点并将它们分组,然后选择最有效的容器。 p>

您的想法是什么,不是专门用于任何语言,但我将使用PHP或JavaScript。 p>

  Edit 
   code>  pre> 
 
 

我忘了提到节点将被流入,因此它需要接受无限的节点,将它们放入容器中,创建新容器甚至删除它们 必要时,最多可容纳数百万个集装箱。 那将是最理想的。 p> div>

This problem is called clustering. You have a set of nodes and a function m that calculates the distance between any two nodes. You now search for clusters so that the sum of all the distances between all nodes inside each cluster is minimal.

There are some easy algorithms to do this. Search for k-Means and k-Medoid for example. These two are very similar to your approach. A more efficient version is the CLARANS algorithm [NH94]. I didn't find any good sources for you but here you go:

(German) Script on clustering in general. Contains CLARANS in pseudo-code on page 45 http://www.informatik.hu-berlin.de/forschung/gebiete/wbi/teaching/archive/ws1112/vl_datawarehousing/15_clustering_12.pdf

English script that explains CLARANS http://bib.dbvis.de/uploadedFiles/232.pdf

Paper about CLARANS http://www.comp.nus.edu.sg/~atung/publication/pakdd002.pdf

The "k" in the names is the number of clusters. For those 3 algorithms you have to specify the number of clusters a priori.

For a different approach, see the DBSCAN algorithm. You won't need the number of clusters for this algorithm, but you have to provide some other knowledge of your nodes. The wikipedia article explains this very well. :-)