平衡图像数据集的几种不平衡类

问题描述：

我在基本目录中有一个包含12个类的数据集.但是，这12个类别由若干数量的图像组成.12类图像的数量不一致，因此会影响总精度.因此，我应该将数据扩充应用于数据量少的特定类吗?

I have a dataset that has 12 classes in the base directory. However, these 12 classes consist of several amounts of Images. The number of images of 12 classes is inconsistent therefore its impacts the total accuracy. Thus, should I apply the data augmentation to the particular classes that have a low amount of data?

每个类别的图像数据:

#Dummy Classes

    [AAAA: 713
    ABCD: 274
    ACBD: 335
    ADBC: 576
    BBBB: 538
    BACD: 607
    BCAD: 253
    BDAD: 257
    CCCC: 463
    CABD: 309
    CBAD: 452
    CDAB: 762]

因此，如果我应该应用数据增强来增加低级类别中的数据量，那么我应该应用数据增强，但是它不会增加图像数据.除此之外，我想用原始数据生成扩充数据，这意味着输入和输出目录将是相同的.

Therefore, if should I apply data augmentation to increase the amount of data in the lower classes, as well as I apply the data augmentation but it does not increase the image data. Besides that, I want to generate the augmented data with the raw data that means the input and out directory will be the same.

特定(个体类)的增强代码:

Code of Augmentation for particular (Individual Classes):

from keras.preprocessing.image import ImageDataGenerator


datagen = ImageDataGenerator(
    rotation_range=45,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range = 0.2,
    zoom_range = 0.2, 
    horizontal_flip=True,
    fill_mode = 'reflect', cval = 125)

i = 0

for batch in datagen.flow_from_directory(directory = ('/content/dataset/ABCD'),
                                         batch_size = 317,
                                         target_size = (256, 256),
                                         color_mode = ('rgb'),
                                         save_to_dir = ('/content/dataset/ABCD'),
                                         save_prefix = ('aug'),
                                         save_format = ('png')):
  i += 1
  if i > 100:
    break

输出:找到0个属于0类的图像.

答

正如我所提到的，我正在使用 flow_from_dataframe ，因此，如果您这样做，则可以从为数据集创建一个csv文件开始没有一个.我的想法是将每个标签的当前数据集重复到固定数量的样本，例如，您要为数据集中的每个标签提供762个样本.这是我使用一些虚拟数据集的方法.

As I mentioned, I am using flow_from_dataframe, so you might start with creating a csv file for your dataset, in case you do not have one. My idea is to repeat the current dataset to a fixed number of samples for each label, e.g., you want 762 sample for every label in your dataset. Here is my approach with some dummy dataset.

import numpy as np
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
import cv2

cv2.imwrite('temp.png',np.random.rand(3,3)) # Create a dummy image to be able to use flow_from_dataframe later

labels = [] # Create some unbalanced dataset
for i in range(10):
    labels.append('a')

for i in range(5):
    labels.append('b')

for i in range(3):
    labels.append('c') 

# Create a dataframe
df = pd.DataFrame({'img_path':['./temp.png']*len(labels),'label':labels})

# print(df.head())

def balance_data(df,target_size=12):
    """
    Increase the number of samples to number_of_samples for every label

        Example:
        Current size of the label a: 10
        Target size: 23

        repeat, mod = divmod(target_size,current_size) 
        2, 3 = divmod(23,10)

        Target size: current size * repeat + mod 

    Repeat this example for every label in the dataset.
    """

    df_groups = df.groupby(['label'])
    df_balanced = pd.DataFrame({key:[] for key in df.keys()})

    for i in df_groups.groups.keys():
        df_group = df_groups.get_group(i)
        df_label = df_group.sample(frac=1)
        current_size = len(df_label)

        if current_size >= target_size:
            # If current size is big enough, do nothing
            pass
        else:

            # Repeat the current dataset if it is smaller than target_size 
            repeat, mod = divmod(target_size,current_size)
            

            df_label_new = pd.concat([df_label]*repeat,ignore_index=True,axis=0)
            df_label_remainder = df_group.sample(n=mod)

            df_label_new = pd.concat([df_label_new,df_label_remainder],ignore_index=True,axis=0)

            # print(df_label_new)

        df_balanced = pd.concat([df_balanced,df_label_new],ignore_index=True,axis=0)


    return df_balanced

df_balanced = balance_data(df)
# print(df_balanced)

# A particular image will be transformed to its various versions within the augmentation step 
image_datagen = ImageDataGenerator(
    rotation_range=45,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range = 0.2,
    zoom_range = 0.2, 
    horizontal_flip=True,
    fill_mode = 'reflect', cval = 125)

image_generator = image_datagen.flow_from_dataframe(
            dataframe=df_balanced,
            x_col="img_path",
            y_col="label",
            class_mode="categorical",
            batch_size=4,
            shuffle=True
            )

# x,y=next(image_generator)

我希望代码能自我解释.让我知道您是否需要进一步的帮助.

I hope the code is self explanatory. Let me know if you need further assistance.

平衡图像数据集的几种不平衡类

相关推荐