如何使用OpenMP提供的GPU?

问题描述:

我试图让一些代码在使用OpenMP的GPU上运行,但我没有成功。在我的代码中,我为循环使用执行矩阵乘法:一次使用OpenMP pragma标记,一次没有。 (这是为了能够比较执行时间)。在第一次循环之后,我调用 omp_get_num_devices()(这是我的主要测试,看看我是否实际连接到)无论我尝试什么, omp_get_num_devices()总是返回0。

I am trying to get some code to run on the GPU using OpenMP, but I am not succeeding. In my code, I am performing a matrix multiplication using for loops: once using OpenMP pragma tags and once without. (This is so that I can compare the execution time.) After the first loop I call omp_get_num_devices() (this is my main test to see if I'm actually connecting to a GPU.) No matter what I try, omp_get_num_devices() always returns 0.

我使用的计算机有两个 NVIDIA Tesla K40M GPU 。 CUDA 7.0和CUDA 7.5作为模块在计算机上提供,并且CUDA 7.5模块通常处于活动状态。 gcc 4.9.3,5.1.0和7.1.0都可以作为模块使用,gcc 7.1.0模块通常处于活动状态。我正在用 $ g ++ -fopenmp -omptargets = nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting 编译我的代码。我已经成功使用CPU并行处理了OpenMP代码,但没有使用GPU。

The computer I am using has two NVIDIA Tesla K40M GPUs. CUDA 7.0 and CUDA 7.5 are available on the computer as modules, and the CUDA 7.5 module is typically active. gcc 4.9.3, 5.1.0, and 7.1.0 are all available as modules, with gcc 7.1.0 module typically active. I am compiling my code with $ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting. I've had OpenMP code successfully parallelized using the CPU, but not with GPUs.

我的主要目标是获得 omp_get_num_devices()返回2作为证明我可以检测并使用OpenMP的GPU。我接受任何帮助将不胜感激。

My main goal here is to get omp_get_num_devices() to return 2 as proof that I can detect and use the GPUs with OpenMP. Any help I recieve here would be greatly appreciated.

这是我用来检查GPU是否正在使用的代码使用正确与否:

Here is the code I am using to check if the GPU is being used correctly or not:

#include <omp.h>
#include <fstream>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <iomanip>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <time.h>
using namespace std;

double A [501][501];
double B [501][501];
double C [501][501][501];
double D [501][501];
double E [501][501];
double F [501][501][501];
double dummyvar;
int Mapped [501];

int main() {
    int i, j, k, l, N, StallerGPU, StallerCPU;

    //
    N = 500;

    // Variables merely uses to make the execution take longer and to
    //   exaggurate the difference in performance between first and second
    //   calculation
    StallerGPU = 200;
    StallerCPU = 200;

    std::cout << " N = " << N << "\n";
    // generate matrix to be used in first calculation
    for (i=0; i<N; i++) {
        for (k=0; k<N; k++) {
            if (i == k) {
                A[i][k] = i+1;
            } else {
                A[i][k] = i * k / N;
            }
        }
    }
    // generate other matrix to be used for the first calculation
    for (k=0; k<N; k++) {
        for (j=0; j<N; j++) {
            B[k][j] = 2*(N-1)-k-j;
        }
    }

//    Slightly adjusted matrices for second calculation
    for (i=0; i<N; i++) {
        for (k=0; k<N; k++) {
            if (i == k) {
                D[i][k] = i+2;
            } else {
                D[i][k] = i * k / N - 1;
            }
        }
    }

    for (k=0; k<N; k++) {
        for (j=0; j<N; j++) {
            E[k][j] = 2*(N+1)-k-j;
        }
    }

    dummyvar = 0;

    //Run the multiplication in parallel using GPUs

    double diff;
    time_t time1;
    time1 = time( NULL ); // CPU time counter
    cout << endl << " GPU section begins at " << ctime(&time1) << endl;

        //    This pragma is frequently changed to try different tags
        #pragma omp for collapse(4) private(i, j, k, l)

        for (i=0; i<N; i++) {
//            Mapped[i] = omp_is_initial_device();
            for (j=0; j<N; j++) {
                for (k=0; k<N; k++) {
                    for(l = 0; l < StallerGPU; l++ ) {
                        C[i][j][k] = A[i][k] * B[k][j] ;
                        dummyvar += A[i][k] * B[k][j] * (l + 1);
                    }
                }
//            cout << " i " << i << endl;
            }
        }


    //record the time it took to run the multiplication    
    time_t time2 = time( NULL );
    cout << " number of devices: " << omp_get_num_devices() << endl;
    cout << " dummy variable: " << dummyvar << endl;

    float cpumin = difftime(time2,time1);
    diff = difftime(time2,time1);
    cout << " stopping at delta GPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time2) << endl;
    cout << " GPU time elasped " << diff << " s" << endl;
    cout << endl;

    dummyvar = 0;
    time_t time3 = time( NULL );
    cout << endl << " CPU section begins at " << ctime(&time3) << endl;
//    #pragma omp single
    for (i=0; i<N; i++) {
        for (j=0; j<N; j++) {
            for (k=0; k<N; k++) {
                for (int l=0; l<StallerCPU; l++) {
                    F[i][j][k] = D[i][k] * E[k][j];
                    dummyvar += D[i][k] * E[k][j] * (l - 1);
                }
            }
        }
    }
    // the sum to complete the matrix calculation is left out here, but would
    // only be used to check if the result of the calculation is correct

    time_t time4 = time( NULL );
    cpumin = difftime(time4,time3);
    diff = difftime(time4,time3);
    cout << " dummy variable: " << dummyvar << endl;
    cout << " stopping at delta CPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time4) << endl;
    cout << " CPU time elasped " << diff << " s" << endl;
    //Compare the time it took to confirm that we actually used GPUs to parallelize.
}

以下是运行deviceQuery示例CUDA代码的结果。

Here is the result of running the deviceQuery sample CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS


GCC 4.9.3和5.1。 0绝对不支持OpenMP卸载到GPU。
GCC 7.1.0确实支持它,但它应该使用特殊的配置选项构建,如此处所述

GCC 4.9.3 and 5.1.0 definitely do not support OpenMP offloading to GPU. GCC 7.1.0 does support it, however it should be built with special configure options, as described here.