我可以将MPI与共享内存一起使用吗
我已经编写了用于高度并行执行的仿真软件,使用MPI用于节点间,使用线程进行节点内并行化,以尽可能使用共享内存来减少内存占用. (最大的数据结构大多是只读的,因此我可以轻松管理线程安全.)
I have written a simulation software for highly parallelized execution, using MPI for internode and threads for intranode parallelization to reduce the memory footprint by using shared memory where possible. (The largest data structures are mostly read-only, so I can easily manage thread-safety.)
尽管我的程序运行良好(最终),但我仍在思考这种方法是否真的最佳,这主要是因为管理两种类型的并行化确实需要在这里和那里编写一些混乱的异步代码.
Although my program works fine (finally), I am having second thoughts about whether this approach is really best, mostly because managing two types of parallelizations does require some messy asynchronous code here and there.
我找到了纸(
I found a paper (pdf draft) introducing a shared memory extension to MPI, allowing the use of shared data structures within MPI parallelization on a single node.
我对MPI的经验不是很丰富,所以我的问题是:最新的标准Open MPI实现是否可能实现?我在哪里可以找到有关如何做的介绍/教程?
I am not very experienced with MPI, so my question is: Is this possible with recent standard Open MPI implementations and where can I find an introduction / tutorial on how to do it?
请注意,我不是在谈论如何通过共享内存来完成消息传递,我知道MPI会做到这一点.我想从多个MPI处理器(读取)访问内存中的同一对象.
Note that I am not talking about how message passing is accomplished with shared memory, I know that MPI does that. I would like to (read-)access the same object in memory from multiple MPI processors.
可以完成-这是一个测试代码,可以在每个共享内存节点上建立一个小表.实际上只有一个进程(节点等级0)分配和初始化该表,但是节点上的所有进程都可以读取该表(格式化的道歉-似乎是空格/制表符问题)
This can be done - here is a test code that sets up a small table on each shared memory node. Only one process (node rank 0) actually allocates and initialises the table, but all processes on a node can read it (apologies for the formatting - seems to be a space/tab issue)
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(void)
{
int i, flag;
int nodesize, noderank;
int size, rank, irank;
int tablesize, localtablesize;
int *table, *localtable;
int *model;
MPI_Comm allcomm, nodecomm;
char verstring[MPI_MAX_LIBRARY_VERSION_STRING];
char nodename[MPI_MAX_PROCESSOR_NAME];
MPI_Aint winsize;
int windisp;
int *winptr;
int version, subversion, verstringlen, nodestringlen;
allcomm = MPI_COMM_WORLD;
MPI_Win wintable;
tablesize = 5;
MPI_Init(NULL, NULL);
MPI_Comm_size(allcomm, &size);
MPI_Comm_rank(allcomm, &rank);
MPI_Get_processor_name(nodename, &nodestringlen);
MPI_Get_version(&version, &subversion);
MPI_Get_library_version(verstring, &verstringlen);
if (rank == 0)
{
printf("Version %d, subversion %d\n", version, subversion);
printf("Library <%s>\n", verstring);
}
// Create node-local communicator
MPI_Comm_split_type(allcomm, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
// Only rank 0 on a node actually allocates memory
localtablesize = 0;
if (noderank == 0) localtablesize = tablesize;
// debug info
printf("Rank %d of %d, rank %d of %d in node <%s>, localtablesize %d\n",
rank, size, noderank, nodesize, nodename, localtablesize);
MPI_Win_allocate_shared(localtablesize*sizeof(int), sizeof(int),
MPI_INFO_NULL, nodecomm, &localtable, &wintable);
MPI_Win_get_attr(wintable, MPI_WIN_MODEL, &model, &flag);
if (1 != flag)
{
printf("Attribute MPI_WIN_MODEL not defined\n");
}
else
{
if (MPI_WIN_UNIFIED == *model)
{
if (rank == 0) printf("Memory model is MPI_WIN_UNIFIED\n");
}
else
{
if (rank == 0) printf("Memory model is *not* MPI_WIN_UNIFIED\n");
MPI_Finalize();
return 1;
}
}
// need to get local pointer valid for table on rank 0
table = localtable;
if (noderank != 0)
{
MPI_Win_shared_query(wintable, 0, &winsize, &windisp, &table);
}
// All table pointers should now point to copy on noderank 0
// Initialise table on rank 0 with appropriate synchronisation
MPI_Win_fence(0, wintable);
if (noderank == 0)
{
for (i=0; i < tablesize; i++)
{
table[i] = rank*tablesize + i;
}
}
MPI_Win_fence(0, wintable);
// Check we did it right
for (i=0; i < tablesize; i++)
{
printf("rank %d, noderank %d, table[%d] = %d\n",
rank, noderank, i, table[i]);
}
MPI_Finalize();
}
以下是两个节点上6个进程的一些示例输出:
Here is some sample output for 6 processes across two nodes:
Version 3, subversion 1
Library <SGI MPT 2.14 04/05/16 03:53:22>
Rank 3 of 6, rank 0 of 3 in node <r1i0n1>, localtablesize 5
Rank 4 of 6, rank 1 of 3 in node <r1i0n1>, localtablesize 0
Rank 5 of 6, rank 2 of 3 in node <r1i0n1>, localtablesize 0
Rank 0 of 6, rank 0 of 3 in node <r1i0n0>, localtablesize 5
Rank 1 of 6, rank 1 of 3 in node <r1i0n0>, localtablesize 0
Rank 2 of 6, rank 2 of 3 in node <r1i0n0>, localtablesize 0
Memory model is MPI_WIN_UNIFIED
rank 3, noderank 0, table[0] = 15
rank 3, noderank 0, table[1] = 16
rank 3, noderank 0, table[2] = 17
rank 3, noderank 0, table[3] = 18
rank 3, noderank 0, table[4] = 19
rank 4, noderank 1, table[0] = 15
rank 4, noderank 1, table[1] = 16
rank 4, noderank 1, table[2] = 17
rank 4, noderank 1, table[3] = 18
rank 4, noderank 1, table[4] = 19
rank 5, noderank 2, table[0] = 15
rank 5, noderank 2, table[1] = 16
rank 5, noderank 2, table[2] = 17
rank 5, noderank 2, table[3] = 18
rank 5, noderank 2, table[4] = 19
rank 0, noderank 0, table[0] = 0
rank 0, noderank 0, table[1] = 1
rank 0, noderank 0, table[2] = 2
rank 0, noderank 0, table[3] = 3
rank 0, noderank 0, table[4] = 4
rank 1, noderank 1, table[0] = 0
rank 1, noderank 1, table[1] = 1
rank 1, noderank 1, table[2] = 2
rank 1, noderank 1, table[3] = 3
rank 1, noderank 1, table[4] = 4
rank 2, noderank 2, table[0] = 0
rank 2, noderank 2, table[1] = 1
rank 2, noderank 2, table[2] = 2
rank 2, noderank 2, table[3] = 3
rank 2, noderank 2, table[4] = 4