Tensorflow:在GPU和CPU上同时进行预测

问题描述：

我正在使用张量流，并且我想通过同时使用来加快预训练Keras模型的预测阶段(我对训练阶段不感兴趣) CPU和一个GPU.

I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I'm not interested in the training phase) by using simultaneously the CPU and one GPU.

我试图创建2个不同的线程来供给两个不同的tensorflow会话(一个在CPU上运行，另一个在GPU上运行).每个线程在循环中提供固定数量的批处理(例如，如果我们总共有100个批处理，我想为CPU分配20个批处理，在GPU上分配80个批处理，或者这两者的任何可能组合)并合并结果.如果拆分是自动完成的，那会更好.

I tried to create 2 different threads that feed two different tensorflow sessions (one that runs on CPU and the other that runs on GPU). Each thread feeds a fixed number of batches (e.g. if we have an overall of 100 batches, I want to assign 20 batches for CPU and 80 on GPU, or any possible combination of the two) in a loop and combine the result. It would be better if the split was done automatically.

但是即使在这种情况下，批处理似乎仍以同步方式进行，因为即使将少量批处理发送到CPU并计算GPU中的所有其他批处理(以GPU为瓶颈)，我也观察到总体相对于仅使用GPU进行的测试，预测时间总是更高.

However even in this scenario, it seems that the batches are being fed in a synchronous way, because even sending few batches to the CPU and computing all the others in the GPU (with the GPU as bottleneck) I observed that the overall prediction time is always higher with respect to the test made only using the GPU.

我希望它会更快，因为只有GPU在工作时，CPU的使用率约为20％至30％，因此有一些CPU可以用来加快计算速度.

I would expect it to be faster because when only the GPU is working the CPU usage is about 20-30%, thus there is some CPU available to speed up the computation.

我阅读了很多讨论，但是它们都涉及多个GPU的并行性，而不是GPU和CPU之间的并行性.

I read a lot of discussions but they all deal with parallelism with multiple GPUs and not between GPU and CPU.

这里是我编写的代码的示例:tensor_cpu和tensor_gpu对象以这种方式从同一Keras模型加载:

Here is a sample of the code I have written: the tensor_cpu and tensor_gpu objects are loaded from the same Keras model in this way:

with tf.device('/gpu:0'):
    model_gpu = load_model('model1.h5')
    tensor_gpu = model_gpu(x)

with tf.device('/cpu:0'):
    model_cpu = load_model('model1.h5')
    tensor_cpu = model_cpu(x)

然后按以下步骤进行预测:

Then the prediction is done as following:

def predict_on_device(session, predict_tensor, batches):
    for batch in batches:
        session.run(predict_tensor, feed_dict={x: batch})


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session1.run(tf.global_variables_initializer())
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session2.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))

    t_cpu.start()
    t_gpu.start()

    coord.join([t_cpu, t_gpu])

    session1.close()
    session2.close()

如何实现CPU/GPU并行化?我想我缺少了什么.

How can I achieve this CPU/GPU parallelization? I think I'm missing something.

任何帮助将不胜感激！

答

这是我的代码，演示如何并行执行CPU和GPU:

Here's my code that demonstrates how CPU and GPU execution can be done in parallel:

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 8

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return session.run(y, feed_dict={x : data})


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
    sess.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:
        t.start()

    coord.join(threads)

    t1 = time()


print t1 - t0

计时结果为:

CPU线程:4-5s(当然，这会因计算机而异).

CPU thread: 4-5s (will vary by machine, of course).

GPU线程:5秒(其工作量是原来的16倍).

GPU thread: 5s (It does 16x as much work).

同时:5s

请注意，不需要进行2次会话(但这对我也有用).

Note that there was no need to have 2 sessions (but that worked for me too).

您可能看到不同结果的原因可能是

The reasons you might be seeing different results could be

对系统资源的某些争用(GPU执行确实消耗了一些主机系统资源，如果运行CPU线程拥挤它，可能会降低性能)

some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)

时间不正确

部分模型只能在GPU/CPU上运行

part of your model can only run on GPU/CPU

其他地方的瓶颈

其他一些问题

Tensorflow:在GPU和CPU上同时进行预测

相关推荐