按时间和键划分的分布式Pearson互相关矩阵计算算法

问题描述:

在我的数据除以不同节点之间的id(例如1-4)和时间(例如:Jan-Dec)的分布式环境中,计算皮尔逊互相关矩阵的算法是什么?

What could be an algorithm for computation of Pearson cross-correlation matrix in a distributed environment where my data is divided by id (say: 1-4) and time (say: Jan-Dec) among different nodes.

例如:

Node A({id1, Jan}, {id2, Jan}); Node B({id3, Jan}, {id4, Jan}),
Node C({id1, Feb}, {id2, Feb}); Node A({id1, March}{id2, March}),
Node C({id3, Feb}, {id4, Feb}); Node B({id3, March}, {id4, March})

基本上,我的意思是说所有ID的1月数据不在一个节点上。

Basically, I meant to say Jan data for all id is not at one node.

我想知道在Pearson相关性是成对计算的情况下,在不必将大数据从一个节点传送到另一个节点的情况下可以使用的策略。我可以在节点之间传输较小的中间结果。我应该如何根据ID和时间对数据进行分区,以便有效地计算多个ID之间的互相关矩阵。

I'm wondering what strategy I could use where I do not have to ship large data from one node to another node as Pearson correlation is a pairwise computation. I'm ok with just transferring small intermediate result between nodes. How should I partition my data based on id and time so that I efficiently calculate cross-correlation matrix among multiple ids.

选择的语言是C ++

The language of choice is C++


两个数据向量之间的相关性是 cor(X,Y)= cov(X,Y )/ [sd(X)* sd(Y)] 。有什么方法可以将它们分解为块计算吗?必需的基本计算(因为 sd(X)= sqrt(cov(X,X))是

The correlation between two vectors of data is cor(X,Y) = cov(X,Y)/[sd(X) * sd(Y)]. Is there any way to break these up into block computations? The essential computation required (since sd(X) = sqrt(cov(X,X)) is

cov(X,Y) = <X Y> - <X> <Y>
         = 1/N (sum[i] X[i] Y[i]) - 1/N (sum[i] X[i]) * 1/N (sum[i] Y[i])

这是所有索引i的总和。但是,每个索引i对应于具有 N_n 个事件和一个子索引的节点n(在该节点中) k_n

This is a sum over all indices i. Each index i, however, corresponds to a node n with N_n events and a sub-index (in that node) k_n:

cov(X,Y) = 1/N (sum[n] sum[k_n] X[k_n] Y[k_n])
         - 1/N^2 (sum[n] sum[k_n] X[k_n]) * (sum[n] sum[k_n] Y[i])

因为 N = sum [ n] N_n ,可以将其重写为

cov(X,Y) = (sum[n] N_n/N 1/N_n sum[k_n] X[k_n] Y[k_n])
         - (sum[n] N_n/N 1/N_n sum[k_n] X[k_n]) * (sum[n] N_n/N 1/N_n sum[k_n] Y[i])
         = (sum[n] N_n/N <XY>_n) - (sum[n] N_n/N <X>_n) * (sum[n] N_n/N <Y>_n)

因此,每个节点都需要唯一代表排序其条目数 N_n 和均值< X> _n,< Y> _n < XY> _n (并且,为了进行关联,将< X ^ 2> _n < Y ^ 2> _n )。然后,可以通过将这些均值与适当的权重 N_n / N (其中 N = sum [n] N_n )来获取全局平均值。

So, each node need only report its number of entries N_n and the means <X>_n, <Y>_n, and <XY>_n (and, for the purposes of the correlation, <X^2>_n and <Y^2>_n) within the node. The global covariance can then be calculated via summing these means together with the appropriate weights N_n/N (where again N = sum[n] N_n) to get the global means.

编辑:LaTeX版本

由于没有LaTeX很难解析这些方程式,因此这里有一些更易于理解的图像版本。数据X和Y的两个列表的协方差定义为

Since these equations are hard to parse without LaTeX, here are some more understandable image versions. The covariance of two lists of data X and Y is defined to be

其中每个数量< X>,< Y> < XY> 是(列表X,列表Y和成对乘积列表XY的平均值)。平均值的计算可以分解为各个节点上的加权和。调用X,Y,XY或X ^ 2或Y ^ 2中的任何一个(计算相关性必需)Z时,Z的平均值为:

where each quantity <X>, <Y>, and <XY> is a mean (of the list X, the list Y, and the pairwise product list XY). The computation of the means can be broken down as a weighted sum over the various nodes. Calling any of X, Y, XY, or X^2 or Y^2 (necessary to compute the correlation) Z, the mean of Z is:

其中< Z> _k 是Z的均值第k个节点, N_k 是第k个节点中的数据点数。这将每个节点所需的信息量减少到 N_k,< X> _k,< Y> _k,< XY> _k,< X ^ 2> _k < Y ^ 2> _k

where <Z>_k is the mean of Z on the k-th node and N_k is the number of data points in the k-th node. This reduces the amount of information needed from each node to N_k, <X>_k, <Y>_k, <XY>_k, <X^2>_k, and <Y^2>_k.