将使用make_csv_dataset创建的TensorFlow数据集分为3个部分(X1_Train,X2_Train和Y_Train)以用于多输入模型

问题描述:

我正在使用Tensorflow 2和Keras训练深度学习模型.我用tf.data.experimental.make_csv_dataset读取了我的大型CSV文件,然后将其拆分为训练和测试数据集.但是,由于深度学习模型需要在不同层中输入两组输入,因此我需要将火车数据集分为三部分,因此我需要将[x1_train, x2_train],y_train传递给model.fit.

I am training a deep learning model with Tensorflow 2 and Keras. I read my big CSV file with tf.data.experimental.make_csv_dataset and then split it into train and test datasets. However, I need to split my train dataset into three parts since my deep learning model takes two set of inputs in different layers so I need to pass [x1_train, x2_train],y_train to model.fit.

我的问题是,如何将train_dataset分为x1_train,x2_trainy_train? (某些功能应在x1_train中,而某些功能应在x2_train中.)

My question is that how can I split train_dataset into x1_train,x2_train and y_train? (some features shall be in x1_train and some features shall be in x2_train).

我的代码:

def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=64, 
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
  return dataset

full_dataset = get_dataset(dataset_path)
full_dataset = full_dataset.shuffle(buffer_size=400000)
train_dataset = full_dataset.take(360000)
test_dataset = full_dataset.skip(360000)
test_dataset = test_dataset.take(40000)
x1_train =train_dataset[:,0:2820]
x2_train =train_dataset[:,2820:2822]
y_train=train_dataset[:,2822]
x1_test =x_test[:,0:2820]
x2_test =x_test[:,2820:2822]
y_test=test_dataset[:,2822]
model.fit([x1_train,x2_train],y_train,validation_data=[x1_test,x2_test],y_test, callbacks=callbacks_list, verbose=1,epochs=EPC)

错误消息:

x1_train =train_dataset[:,0:2820]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'TakeDataset' object is not subscriptable

如注释部分所述,您可以使用make_csv_dataset返回的map方法Dataset对象,以拆分和合并样本根据模型的预期输入格式.

As mentioned in the comments sections, you can use map method Dataset object which is returned by make_csv_dataset in order to split and combine the samples according to your model's expected input format.

例如,假设我们有一个包含以下数据的CSV文件:

For example, suppose we have a CSV file containing the following data:

a,b,c,d,e
1,2,3,4,111
5,6,7,8,222
9,10,11,12,333
13,14,15,16,444

现在,假设我们要使用maks_csv_dataset函数读取此CSV文件;但是,我们的模型有两个名为input1input2的输入层(使用Input层的name参数设置),其中input1被馈入列ab中的特征值,并且input2使用列cd中的特征值.此外,e列是我们的目标(即标签)列.

Now, suppose we want to read this CSV file with maks_csv_dataset function; however, our model has two input layers named input1 and input2 (set using name argument of Input layer) where input1 is fed the feature values in column a and b, and the input2 uses the feature values in column c and d. Further, the column e is our target (i.e. label) column.

因此,让我们首先阅读此数据,然后看一下它的样子:

So let's first read this data and see how it looks like:

from pprint import pprint

dataset = tf.data.experimental.make_csv_dataset(
      'data.csv',
      batch_size=2,
      label_name='e',
      num_epochs=1,
)

for x in dataset:
    pprint(x)

"""
The printed result:

(OrderedDict([('a',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([5, 1], dtype=int32)>),
              ('b',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([6, 2], dtype=int32)>),
              ('c',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([7, 3], dtype=int32)>),
              ('d',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([8, 4], dtype=int32)>)]),
 <tf.Tensor: shape=(2,), dtype=int32, numpy=array([222, 111], dtype=int32)>)
(OrderedDict([('a',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([13,  9], dtype=int32)>),
              ('b',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([14, 10], dtype=int32)>),
              ('c',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([15, 11], dtype=int32)>),
              ('d',
               <tf.Tensor: shape=(2,), dtype=int32, numpy=array([16, 12], dtype=int32)>)]),
 <tf.Tensor: shape=(2,), dtype=int32, numpy=array([444, 333], dtype=int32)>)
"""

如您所见,每个批次的第一个元素是一个字典,将列名称映射到相应的功能值.现在,让我们使用map方法将这些特征值拆分并组合成适合我们模型的格式:

As you can see, the first element of each batch is a dictionary mapping column names to the respective feature values. Now, let's use map method to split and combine these feature values into proper format for our model:

first_input_cols = ['a', 'b']
second_input_cols = ['c', 'd']

def split_and_combine_batch_samples(samples, targets):
    inp1 = []
    for k in first_input_cols:
        inp1.append(samples[k])
    inp2 = []
    for k in second_input_cols:
        inp2.append(samples[k])
    
    inp1 = tf.stack(inp1, axis=-1)
    inp2 = tf.stack(inp2, axis=-1)
    return {'input1': inp1, 'input2': inp2}, targets

dataset = dataset.map(split_and_combine_batch_samples)

for x in dataset:
    pprint(x)

"""
The printed values:

({'input1': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[ 9, 10],
       [13, 14]], dtype=int32)>,
  'input2': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[11, 12],
       [15, 16]], dtype=int32)>},
 <tf.Tensor: shape=(2,), dtype=int32, numpy=array([333, 444], dtype=int32)>)
({'input1': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[5, 6],
       [1, 2]], dtype=int32)>,
  'input2': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[7, 8],
       [3, 4]], dtype=int32)>},
 <tf.Tensor: shape=(2,), dtype=int32, numpy=array([222, 111], dtype=int32)>)

"""

就是这样!现在,您可以进一步修改这个新的修改后的数据集(例如,使用takeshuffle等),并在准备好时将其提供给模型的fit方法(不要忘记为输入层的名称命名)您的模型).

That's it! Now you can further modify this new modified dataset (e.g. use take, shuffle, etc.) and when ready you can give it to fit method of your model (don't forget to give names to input layers of your model, though).