如何使用Tensorflow在单个GPU上训练批量较大的大型模型?

问题描述:

我有一个非常大的模型,由于内存不足错误,无法在批量大小为64的单个GPU上进行训练。有人建议我使用较小的批量。但是,如果我减小批量大小,准确性会下降。解决方案之一是仅喂入当前批次的一半,存储梯度,然后喂入剩余的梯度。这可以通过使用 compute_gradients apply_gradients 明确地完成。但这是相对不便的(如果存在简洁的实现就可以了)。因此,我想知道是否有更好的解决方案(或简洁的实现)。

I have a very big model which cannot be trained on a single GPU with batch size 64 due to out of memory error. Someone suggest that I use smaller batch size. However, if I decrease my batch size, the accuracy drops down. One of the solutions is just feeding half of the current batch, storing the gradients and then feeding the remaining. This can be done explicitly by using compute_gradients and apply_gradients. But it is relatively inconvenient (it is OK if a concise implementation exists). So I wonder if there is any nicer solutions (or concise implementation) to this problem.

预先感谢。

您可以考虑研究以下内容: https:// github .com / openai / gradient-checkpointing

You may consider looking into this: https://github.com/openai/gradient-checkpointing.

最近,人们进行了很多研究,以提高反向传播的内存效率,但又增加了额外的前向传递。这是TensorFlow此类方案的最新实现。

There has been a lot of research lately on making backprop more memory efficient at the expense of additional forward passes. This is a very recent implementation of one such scheme for TensorFlow.