Keras中的TimeDistributed(Dense)vs Dense-相同数量的参数
我正在建立一个模型,该模型使用递归层(GRU)将字符串转换为另一个字符串.我已经尝试了Dense和TimeDistributed(Dense)层作为最后一层,但是我不了解使用return_sequences = True时两者之间的区别,特别是因为它们似乎具有相同数量的参数
I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
我的简化模型如下:
InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)
该网络的摘要是:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
这对我来说很有意义,因为我对TimeDistributed的理解是,它在所有时间点都应用了同一层,因此Dense层具有16 * 15 + 15 = 255个参数(权重+偏差).
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
但是,如果我切换到一个简单的Dense层:
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
我仍然只有255个参数:
I still only have 255 parameters:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
dense_1 (Dense) (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
我想知道这是否是因为Dense()将仅使用形状中的最后一个维度,并将其他所有内容有效地视为类似批处理的维度.但是我不再确定Dense和TimeDistributed(Dense)有什么区别.
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
更新:查看 https: //github.com/fchollet/keras/blob/master/keras/layers/core.py 似乎Dense似乎仅使用最后一个尺寸来调整尺寸:
Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
它还使用keras.dot来应用权重:
It also uses keras.dot to apply the weights:
def call(self, inputs):
output = K.dot(inputs, self.kernel)
keras.dot的文档暗示它可以在n维张量上正常工作.我想知道它的确切行为是否意味着实际上将在每个时间步调用Dense().如果是这样,问题仍然在于在这种情况下TimeDistributed()会实现什么.
The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.
TimeDistributedDense在GRU/LSTM Cell展开期间的每个时间步骤都应用相同的密度.因此,误差函数将在预测的标签序列与实际的标签序列之间. (通常是序列到序列标记问题的要求.)
TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).
但是,如果return_sequences = False,则在最后一个单元格仅应用一次密集层.当RNN用于分类问题时,通常是这种情况.如果return_sequences = True,则将密集层应用于每个时间步,就像TimeDistributedDense一样.
However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.
因此对于每个模型而言,两者都是相同的,但是如果您将第二个模型更改为"return_sequences = False",则仅在最后一个单元格应用密度.尝试更改它,模型将抛出错误,因为Y的大小为[Batch_size,InputSize],它不再是要排序的序列,而是要标注问题的完整序列.
So for as per your models both are same, but if u change your second model to "return_sequences=False" then the Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.
from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])
model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)
print(model1.summary())
print(model2.summary())
print(model3.summary())
在上面的示例中,模型1和模型2的体系结构是样本(序列模型的序列),模型3是完整序列到模型的标签.
In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.