GradientTape、implicit_gradients、gradients_function 和implicit_value_and_gradients 之间有什么区别?

问题描述：

我正在尝试切换到 TensorFlow 热切模式，我找到了 GradientTape、implicit_gradients、gradients_function 和 implicit_value_and_gradients 的文档 令人困惑.

I'm trying to switch to TensorFlow eager mode and I find the documentation of GradientTape, implicit_gradients, gradients_function and implicit_value_and_gradients confusing.

它们之间有什么区别?我什么时候应该使用一个?

What's the difference between them? When should I use one over the other?

文档中的介绍点根本没有提到隐式*函数，但几乎TensorFlow 存储库中的所有示例似乎都使用该方法来计算梯度.

The intro point in the documentation does not mention implicit* functions at all, yet almost all of the examples in TensorFlow repository seems to use that method for computing gradients.

答

当启用 Eager Execution 时，有 4 种方法可以自动计算梯度(实际上，它们也可以在图形模式下工作):

There are 4 ways to automatically compute gradients when eager execution is enabled (actually, they also work in graph mode):

tf.GradientTape 上下文记录计算，以便您可以调用 tfe.gradient() 来获取在记录时计算的任何可训练变量的任何张量的梯度.
tfe.gradients_function() 接受一个函数(比如 f())并返回一个梯度函数(比如 fg())可以计算 f() 输出关于 f() 参数(或它们的子集)的梯度.
tfe.implicit_gradients() 非常相似，但 fg() 计算 f() 输出的梯度关于所有这些输出所依赖的可训练变量.
tfe.implicit_value_and_gradients() 几乎相同，但 fg() 还返回函数 f() 的输出.

tf.GradientTape context records computations so that you can call tfe.gradient() to get the gradients of any tensor computed while recording with regards to any trainable variable.
tfe.gradients_function() takes a function (say f()) and returns a gradient function (say fg()) that can compute the gradients of the outputs of f() with regards to the parameters of f() (or a subset of them).
tfe.implicit_gradients() is very similar but fg() computes the gradients of the outputs of f() with regards to all trainable variables these outputs depend on.
tfe.implicit_value_and_gradients() is almost identical but fg() also returns the output of the function f().

通常，在机器学习中，您会想要计算与模型参数(即变量)相关的损失梯度，并且您通常也会对损失本身的值感兴趣.对于这个用例，最简单和最有效的选项是 tf.GradientTape 和 tfe.implicit_value_and_gradients()(另外两个选项不给你损失本身的值，所以如果你需要它，它将需要额外的计算).我个人更喜欢在编写生产代码时使用 tfe.implicit_value_and_gradients()，在 Jupyter notebook 中进行实验时更喜欢 tf.GradientTape.

Usually, in Machine Learning, you will want to compute the gradients of the loss with regards to the model parameters (ie. variables), and you will generally also be interested in the value of the loss itself. For this use case, the simplest and most efficient options are tf.GradientTape and tfe.implicit_value_and_gradients() (the other two options do not give you the value of the loss itself, so if you need it, it will require extra computations). I personally prefer tfe.implicit_value_and_gradients() when writing production code, and tf.GradientTape when experimenting in a Jupyter notebook.

编辑:在 TF 2.0 中，似乎只剩下 tf.GradientTape 了.也许其他功能会被添加回来，但我不会指望它.

Edit: In TF 2.0, it seems that only tf.GradientTape remains. Maybe the other functions will be added back, but I wouldn't count on it.

让我们创建一个小函数来突出差异:

Let's create a small function to highlight the differences:

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()

w1 = tfe.Variable(2.0)
w2 = tfe.Variable(3.0)

def weighted_sum(x1, x2):
    return w1 * x1 + w2 * x2

s = weighted_sum(5., 7.)
print(s.numpy()) # 31

使用 `tf.GradientTape`

在 GradientTape 上下文中，所有操作都被记录下来，然后您可以计算在上下文中计算的任何张量的梯度，关于任何可训练变量.例如，此代码在 GradientTape 上下文中计算 s，然后计算关于 w1 的 s 梯度>.由于s = w1 * x1 + w2 * x2，s相对于w1的梯度为x1:

Using `tf.GradientTape`

Within a GradientTape context, all operations are recorded, then you can compute the gradients of any tensor computed within the context, with regards to any trainable variable. For example, this code computes s within the GradientTape context, and then computes the gradient of s with regards to w1. Since s = w1 * x1 + w2 * x2, the gradient of s with regards to w1 is x1:

with tf.GradientTape() as tape:
    s = weighted_sum(5., 7.)

[w1_grad] = tape.gradient(s, [w1])
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1

使用`tfe.gradients_function()`

此函数返回另一个函数，该函数可以计算函数返回值相对于其参数的梯度.例如，我们可以用它来定义一个函数来计算 s 相对于 x1 和 x2 的梯度:

Using `tfe.gradients_function()`

This function returns another function that can compute the gradients of a function's returned value with regards to its parameters. For example, we can use it to define a function that will compute the gradients of s with regards to x1 and x2:

grad_fn = tfe.gradients_function(weighted_sum)
x1_grad, x2_grad = grad_fn(5., 7.)
print(x1_grad.numpy()) # 2.0 = gradient of s with regards to x1 = w1

在优化的背景下，关于我们可以调整的变量计算梯度会更有意义.为此，我们可以更改 weighted_sum() 函数以将 w1 和 w2 作为参数，并告诉 tfe.gradients_function() 只考虑名为 "w1" 和 "w2" 的参数:

In the context of optimization, it would make more sense compute gradients with regards to variables that we can tweak. For this, we can change the weighted_sum() function to take w1 and w2 as parameters as well, and tell tfe.gradients_function() to only consider the parameters named "w1" and "w2":

def weighted_sum_with_weights(w1, x1, w2, x2):
    return w1 * x1 + w2 * x2

grad_fn = tfe.gradients_function(weighted_sum_with_weights, params=["w1", "w2"])
[w1_grad, w2_grad] = grad_fn(w1, 5., w2, 7.)
print(w2_grad.numpy()) # 7.0 = gradient of s with regards to w2 = x2

使用 `tfe.implicit_gradients()`

此函数返回另一个函数，该函数可以计算函数返回值相对于它所依赖的所有可训练变量的梯度.回到 weighted_sum() 的第一个版本，我们可以用它来计算 s 关于 w1 和 的梯度w2 而不必显式传递这些变量.请注意，梯度函数返回一个梯度/变量对列表:

Using `tfe.implicit_gradients()`

This function returns another function that can compute the gradients of a function's returned value with regards to all trainable variables it depends on. Going back to the first version of weighted_sum(), we can use it to compute the gradients of s with regards to w1 and w2 without having to explicitly pass these variables. Note that the gradient function returns a list of gradient/variable pairs:

grad_fn = tfe.implicit_gradients(weighted_sum)
[(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1

assert w1_var is w1
assert w2_var is w2

这个函数似乎是最简单和最有用的选项，因为通常我们对计算与模型参数(即变量)相关的损失梯度感兴趣.注意:尝试使 w1 不可训练 (w1 = tfe.Variable(2., trainable=False)) 并重新定义 weighted_sum()，然后你会看到 grad_fn 只返回 s 相对于 w2 的梯度.

This function does seem like the simplest and most useful option, since generally we are interested in computing the gradients of the loss with regards to the model parameters (ie. variables). Note: try making w1 untrainable (w1 = tfe.Variable(2., trainable=False)) and redefine weighted_sum(), and you will see that grad_fn only returns the gradient of s with regards to w2.

这个函数几乎与implicit_gradients()相同，除了它创建的函数还返回被微分的函数的结果(在本例中为weighted_sum()):

This function is almost identical to implicit_gradients() except the function it creates also returns the result of the function being differentiated (in this case weighted_sum()):

grad_fn = tfe.implicit_value_and_gradients(weighted_sum)
s, [(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(s.numpy()) # 31.0 = s = w1 * x1 + w2 * x2

当你需要一个函数的输出和它的梯度时，这个函数可以给你一个很好的性能提升，因为在使用 autodiff 计算梯度时你可以免费获得函数的输出.

When you need both the output of a function and its gradients, this function can give you a nice performance boost, since you get the output of the function for free when computing the gradients using autodiff.

GradientTape、implicit_gradients、gradients_function 和implicit_value_and_gradients 之间有什么区别?

使用 tf.GradientTape

Using tf.GradientTape

使用tfe.gradients_function()

Using tfe.gradients_function()

使用 tfe.implicit_gradients()

Using tfe.implicit_gradients()