如何将 tf.data.Dataset 与 kedro 一起使用?
我正在使用 tf.data.Dataset
准备用于训练 tf.kears 模型的流数据集.使用 kedro,有没有办法创建节点并返回创建的 tf.data.Dataset
在下一个训练节点中使用它?
I am using tf.data.Dataset
to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset
to use it in the next training node?
MemoryDataset
可能不会工作,因为 tf.data.Dataset
不能被腌制(deepcopy
是不可能的),另见 这个问题.根据 issue #91,MemoryDataset
中的深层副本是这样做是为了避免其他节点修改数据.有人可以详细说明为什么/如何进行这种并发修改吗?
The MemoryDataset
will probably not work because tf.data.Dataset
cannot be pickled (deepcopy
isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset
is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?
来自文档,似乎有一个 copy_mode = "assign"
.如果数据不可pickle,是否可以使用此选项?
From the docs, there seems to be a copy_mode = "assign"
. Would it be possible to use this option in case the data is not picklable?
另一种解决方案(也在第 91 期中提到)是仅使用一个函数在训练节点内生成流式 tf.data.Dataset
,而无需前面的数据集生成节点.但是,我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.
Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset
inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.
另外,我想避免存储流数据集的完整输出,例如使用 tfrecords
或 tf.data.experimental.save
因为这些选项会使用大量磁盘存储.
Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords
or tf.data.experimental.save
as these options would use a lot of disk storage.
有没有办法只传递创建的 tf.data.Dataset
对象以将其用于训练节点?
Is there a way to pass just the created tf.data.Dataset
object to use it for the training node?
在此处提供解决方案以造福社区,尽管它在 kedro.community @DataEngineerOne.
Providing workaround here for the benefit of community, though it is presented in kedro.community by @DataEngineerOne.
根据@DataEngineerOne.
According to @DataEngineerOne.
使用kedro,有没有办法创建节点并返回创建的节点tf.data.Dataset 用于下一个训练节点?
With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?
是的,绝对!
有人可以详细说明为什么/如何并发可能会发生修改吗?
Can someone please elaborate a bit more on why/how this concurrent modification could happen?
从文档来看,似乎有一个 copy_mode = "assign";.可不可能是如果数据不可选择,可以使用此选项吗?
From the docs, there seems to be a copy_mode = "assign" . Would it be possible to use this option in case the data is not picklable?
我还没有尝试过这个选项,但理论上应该可行.您需要做的就是在包含 copy_mode
选项的 catalog.yml
文件中创建一个新的数据集条目.
I have yet to try this option, but it should theoretically work. All you would need to do is create a new dataset entry in the catalog.yml
file that includes the copy_mode
option.
例如:
# catalog.yml
tf_data:
type: MemoryDataSet
copy_mode: assign
# pipeline.py
node(
tf_generator,
inputs=...,
outputs="tf_data",
)
我无法保证此解决方案,但请试一试,让我知道它是否适合您.
I can not vouch for this solution, but give it a go and let me know if it works for you.
另一个解决方案(也在 issue 91 中提到)是只使用一个在训练中生成流式 tf.data.Dataset 的函数节点,没有前面的数据集生成节点.但是,我我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.
Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.
这也是一个很好的替代解决方案,我认为(猜测)在这种情况下 MemoryDataSet
将自动使用 assign
,而不是其正常的 deepcopy
,所以你应该没事.
This is also a great alternative solution, and I think (guess) that the MemoryDataSet
will automatically use assign
in this case, rather than its normal deepcopy
, so you should be alright.
# node.py
def generate_tf_data(...):
tensor_slices = [1, 2, 3]
def _tf_data():
dataset = tf.data.Dataset.from_tensor_slices(tensor_slices)
return dataset
return _tf_data
def use_tf_data(tf_data_func):
dataset = tf_data_func()
# pipeline.py
Pipeline([
node(
generate_tf_data,
inputs=...,
outputs='tf_data_func',
),
node(
use_tf_data,
inputs='tf_data_func',
outputs=...
),
])
这里唯一的缺点是额外的复杂性.有关更多详细信息,您可以参考 这里.
The only drawback here is the additional complexity. For more details you can refer here.