如何将 tf.data.Dataset 与 kedro 一起使用?

问题描述：

我正在使用 tf.data.Dataset 准备用于训练 tf.kears 模型的流数据集.使用 kedro，有没有办法创建节点并返回创建的 tf.data.Dataset 在下一个训练节点中使用它?

I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

MemoryDataset 可能不会工作，因为 tf.data.Dataset 不能被腌制(deepcopy 是不可能的)，另见这个问题.根据 issue #91，MemoryDataset 中的深层副本是这样做是为了避免其他节点修改数据.有人可以详细说明为什么/如何进行这种并发修改吗?

The MemoryDataset will probably not work because tf.data.Dataset cannot be pickled (deepcopy isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?

来自文档，似乎有一个 copy_mode = "assign".如果数据不可pickle，是否可以使用此选项?

From the docs, there seems to be a copy_mode = "assign". Would it be possible to use this option in case the data is not picklable?

另一种解决方案(也在第 91 期中提到)是仅使用一个函数在训练节点内生成流式 tf.data.Dataset，而无需前面的数据集生成节点.但是，我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

另外，我想避免存储流数据集的完整输出，例如使用 tfrecords 或 tf.data.experimental.save 因为这些选项会使用大量磁盘存储.

Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords or tf.data.experimental.save as these options would use a lot of disk storage.

有没有办法只传递创建的 tf.data.Dataset 对象以将其用于训练节点?

Is there a way to pass just the created tf.data.Dataset object to use it for the training node?

答

在此处提供解决方案以造福社区，尽管它在 kedro.community @DataEngineerOne.

Providing workaround here for the benefit of community, though it is presented in kedro.community by @DataEngineerOne.

根据@DataEngineerOne.

According to @DataEngineerOne.

使用kedro，有没有办法创建节点并返回创建的节点tf.data.Dataset 用于下一个训练节点?

With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

是的，绝对！

有人可以详细说明为什么/如何并发可能会发生修改吗?

Can someone please elaborate a bit more on why/how this concurrent modification could happen?

从文档来看，似乎有一个 copy_mode = "assign";.可不可能是如果数据不可选择，可以使用此选项吗?

From the docs, there seems to be a copy_mode = "assign" . Would it be possible to use this option in case the data is not picklable?

我还没有尝试过这个选项，但理论上应该可行.您需要做的就是在包含 copy_mode 选项的 catalog.yml 文件中创建一个新的数据集条目.

I have yet to try this option, but it should theoretically work. All you would need to do is create a new dataset entry in the catalog.yml file that includes the copy_mode option.

例如:

# catalog.yml
tf_data:
  type: MemoryDataSet
  copy_mode: assign

# pipeline.py
node(
  tf_generator,
  inputs=...,
  outputs="tf_data",
)

我无法保证此解决方案，但请试一试，让我知道它是否适合您.

I can not vouch for this solution, but give it a go and let me know if it works for you.

另一个解决方案(也在 issue 91 中提到)是只使用一个在训练中生成流式 tf.data.Dataset 的函数节点，没有前面的数据集生成节点.但是，我我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

这也是一个很好的替代解决方案，我认为(猜测)在这种情况下 MemoryDataSet 将自动使用 assign，而不是其正常的 deepcopy，所以你应该没事.

This is also a great alternative solution, and I think (guess) that the MemoryDataSet will automatically use assign in this case, rather than its normal deepcopy, so you should be alright.

# node.py

def generate_tf_data(...):
  tensor_slices = [1, 2, 3]
  def _tf_data():
    dataset = tf.data.Dataset.from_tensor_slices(tensor_slices)
    return dataset
  return _tf_data

def use_tf_data(tf_data_func):
  dataset = tf_data_func()

# pipeline.py
Pipeline([
node(
  generate_tf_data,
  inputs=...,
  outputs='tf_data_func',
),
node(
  use_tf_data,
  inputs='tf_data_func',
  outputs=...
),
])

这里唯一的缺点是额外的复杂性.有关更多详细信息，您可以参考这里.

The only drawback here is the additional complexity. For more details you can refer here.

如何将 tf.data.Dataset 与 kedro 一起使用?

相关推荐