我们是否可以通过接受(或忽略)新功能来使ML模型(棘手文件)更健壮?

问题描述:

  • 我已经训练了一个ML模型,并将其存储到Pickle文件中.
  • 在我的新脚本中,我正在读取新的真实世界数据",我希望以此为基础进行预测.

但是,我在挣扎.我有一列(包含字符串值),例如:

However, I am struggling. I have a column (containing string values), like:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

现在是问题所在..我收到了一个新的(唯一的)值,现在我无法再进行预测了(例如,添加了'Neutral').

Now comes the issue. I received a new (unique) value, and now I cannot make predictions anymore (e.g. 'Neutral' was added).

由于我正在将'Sex'列转换为Dummies,所以确实存在我的模型不再接受输入的问题,

Since I am transforming the 'Sex' column into Dummies, I do have the issue that my model is not accepting the input anymore,

模型的特征数量必须与输入匹配.模型n_features为2,输入n_features为3

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

因此,我的问题是:有什么方法可以使我的模型更健壮,而忽略此类?但是,如果没有具体信息,是否可以做出预测?

我尝试过的事情:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

注意,我进行了搜索,但找不到任何有用的解决方案(不是此处此处

Note, I searched, but could not find any helpfull solution (not here, here or here

更新

还找到了文章.但是这里有同样的问题.我们可以将测试集的列与训练集的列相同...但是新的现实世界数据(例如新值"Neutral")呢?

Also found this article. But same issue here.. we can make the test set with the same columns as training set... but what about new real world data (e.g. the new value 'Neutral')?

是的,在完成训练部分后,您不能在数据集中包括(更新模型)新类别或特征. OneHotEncoder 可能会解决在测试数据的某些功能中添加新类别的问题.在分类和变量方面,它将确保您的训练和测试数据中的列保持一致.

Yes, you can't include (update the model) a new category or feature into a dataset after the training part is done. OneHotEncoder might handle the problem of having new categories inside some feature in test data. It will take care of keep the columns consistent in your training and test data with respect to categorical variables.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)