在主机和目标设备中执行的OpenMP卸载目标区域

问题描述:

我正在研究一个项目,该项目要求使用Clang将OpenMP卸载到Nvidia GPU.我可以按照此处中提到的说明安装Clang以支持卸载.

I'm working on a project which requires OpenMP offloading to Nvidia GPUs using Clang. I was able to install Clang to support offloading by following instructions mentioned here.

系统规格

  • 操作系统-Ubuntu 16.04 LTS
  • Clang -version 4.00
  • 处理器-Intel(R)Core(TM)i7 -4700MQ CPU
  • Cuda -version-9.0
  • Nvidia GPU-GeForce 740M(sm_capability-35)

但是问题是我执行示例程序以测试Nvidia GPU的OpenMP时,部分目标区域倾向于在GPU中运行,然后相同的目标区域开始在主机中执行.

But the problem is I when I execute a sample program to test OpenMP to Nvidia GPUs, part of the target region tends to run in GPU and then same target region starts executing in the host.

请在此处找到示例程序,这是一个编写用于将2个矩阵相乘的小型C程序.

Please find the sample program here, This a small C program written to multiply 2 matrices.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

/* Problem size. */
# define N 1920

void init_array(float* A, float* B)
{
    int i, j;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            A[i*N + j] = ((float) i*j) / N;
        }
    }

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            B[i*N + j] = ((float) i*(j+1)) / N;
        }
    }
}
void  mm_kernel(float *A, float *B, float *E)
{

    #pragma omp target data map(to:A) map(to:B) map(alloc:E)
{
    #pragma omp target
    #pragma omp teams distribute num_teams(4)
        for (int i = 0; i < N; i++)
  {
        printf("Team %d Thread %d Number of threads %d \n", omp_get_team_num() ,omp_get_thread_num(),omp_get_num_threads());
        #pragma omp  parallel for
        for (int j = 0; j < N; j++)
    {
            E[i*N + j] = 0.0;
            for(int k = 0; k < N; k++)
            {
                E[i*N + j] = E[i*N + j] + A[i*N + k] * B[j*N+k];
            }
    }
    }
  }
    }

int main(){
  double t_start, t_end;

    float* A;
    float* B;
    float* E;

    A = (float*)malloc(N*N*sizeof(float));
    B = (float*)malloc(N*N*sizeof(float));
    E = (float*)malloc(N*N*sizeof(float));
    init_array(A, B); //initialize Matrix A and B

    t_start = omp_get_wtime();
    mm_kernel(A,B,E);
    t_end = omp_get_wtime();

    printf("Time spent %lf\n",t_end-t_start );
    free(A);
    free(B);
    free(E);
}

该程序是使用

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda 3mm.c -o 3mmgpu 

声称目标区域在主机和目标设备中都在执行的主要原因是命令行的输出.

The Main reason to claim that target regions are executing in both the host and the target device is due to the output from the command line.

在第一个团队0和第一个团队中,每个团队显示960个,后来的迭代为每个团队提供2个线程(我的处理器是4核处理器,每个内核可以处理2个硬件级线程.)

At first team 0 and team 1 shows 960 per each team and later iterations gives 2 threads per each teams(My processor is 4 core processor capable of handling 2 hardware level threads per core.).

我还尝试使用nvprof执行fat二进制文件,以验证GPU中是否正在执行任何操作.

I also tried executing the fat binary with nvprof in order to verify whether anything is being executed in the GPU.

分析结果如下.

实际上,我无法理解目标区域中正在发生的事情.为什么在主机和目标设备中都执行目标区域.

Actually I cannot understand what is happening in the target region. Why the target region is being executed in both host and target-device.

我正在发布问题的答案,因为我终于能够弄清楚代码中出了什么问题.问题是目标设备崩溃中的卸载区域,因为我将数据错误地映射到了GPU.我只映射了指针,而没有在GPU中分配内存.因此,随着GPU执行崩溃,执行就会在主机中发生.

I'm posting the answer to the question, as I was finally able to figure out what went wrong in the code. The problem was offloaded region in the target-device crashes as I have incorrectly mapped data to the GPU. I have only mapped pointers without allocating memory in the GPU. So as the GPU execution crashes, execution happens in the host.

谢谢@Alexey Bataev指出了这一点.

Thank you @Alexey Bataev for pointing that out.