cudamemcpy指针数组,其中每个指针都指向一个数组

问题描述:

我正在尝试在主机上创建一个指针数组.数组中的每个指针都指向一个大小为4的数组.当我尝试将指针复制到设备时,复制失败并且设备无法访问指针指向的数组的内容.我该如何从指针数组中复制一个指向从主机到设备的数组的指针?

I am attempting to create an array of pointers on the host. Each pointer in the array points to an array of size 4. When I try to copy a pointer to the device, the copy fails and the device cannot access the contents of the array to which the pointer points to. How would I copy a pointer from an array of pointers that points to an array from host to device?

__global__ void kernel(int* D)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < 4)
    {
        printf("Device = %d\n", D[tid]);
        tid += blockDim.x * gridDim.x;
    }
}

int main(void)
{
    cudaProfilerStart();

    int* H[2];
    int* D[2]; 
    int test1[4] = { 1, 2, 3, 4 };
    int test2[4] = { 10, 20, 30, 40 };

    H[0] = test1;
    H[1] = test2;

    HANDLE_ERROR(cudaMalloc((void**)&D[0], 4 * sizeof(int)));
    HANDLE_ERROR(cudaMemcpy(D[0], H[0], 4 * sizeof(int), cudaMemcpyHostToDevice));
    kernel <<<1, 4 >>>(D[0]);

    cudaProfilerStop();

    return 0;
}

正如爪牙指出的那样,《守则》没有错.但是,您不会在内核中看到打印结果,原因是内核调用是异步的,并且您的进程在执行内核打印之前便已结束.同步调用将在此处解决此问题.但是,在实际代码中可能不需要.

As talonmies pointed out there is nothing wrong with the Code. However, you will not see the prints in your kernel, the reason being that the kernel call is asynchronous and your process ends before the kernel prints can be executed. A synchronization call will solve this problem here. However, in real code this might not be needed.

#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <stdio.h>



__global__ void kernel(int* D)
{
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        while (tid < 4)
        {
                printf("Device = %d\n", D[tid]);
                tid += blockDim.x * gridDim.x;
        }
}

int main(void)
{
        // cudaProfilerStart();

        int* H[2];
        int* D[2];
        int test1[4] = { 1, 2, 3, 4 };
        int test2[4] = { 10, 20, 30, 40 };

        H[0] = test1;
        H[1] = test2;

        cudaMalloc((void**)&D[0], 4 * sizeof(int));
        cudaMemcpy(D[0], H[0], 4 * sizeof(int), cudaMemcpyHostToDevice);
        kernel <<<1, 1 >>>(D[0]);

        cudaError_t cudaerr1 = cudaDeviceSynchronize();
        if (cudaerr1 != cudaSuccess)
                printf("kernel launch failed with error \"%s\".\n",
                        cudaGetErrorString(cudaerr1));

         //cudaProfilerStop();

        return 0;
}