如何从分类变量创建交互设计矩阵?

问题描述：

我主要从事R的统计建模/机器学习工作，并希望提高自己的Python技能.我想知道在python中创建分类交互(任意程度)的设计矩阵的最佳方法.

I'm coming from mainly working in R for statistical modeling / machine learning and looking to improve my skills in Python. I am wondering the best way to create a design matrix of categorical interactions (to arbitrary degree) in python.

一个玩具示例:

import pandas as pd
from urllib import urlopen
page = urlopen("http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv")
df = pd.read_csv(page)
df.head(n=5)

让我们说我们要在Outlook，温度和湿度之间创建交互.有没有一种有效的方法可以做到这一点?我可以在熊猫中手动执行以下操作:

Lets say we want to create interactions between Outlook, Temp and Humidity. Is there an efficient way to do this? I can manually do something like this in pandas:

OutTempFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Temperature.values]))[0],name='OutTemp')
OutHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Humidity.values]))[0],name='OutHum')
TempHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Temperature.values, df.Humidity.values]))[0],name='TempHum')

IntFacts=pd.concat([OutTempFact,OutHumFact,TempHumFact],axis=1)
IntFacts.head(n=5)

然后我可以将其传递给scikit-learn一站式编码器，但是可能有一种更好，更省力的方法来创建分类变量之间的交互，而不必逐步完成每个组合.

which I could then pass to a scikit-learn one-hot encoder, but there is likely a much better, less manual way to create interactions between categorical variables without having to step through each combination.

import sklearn as sk
enc = sk.preprocessing.OneHotEncoder()
IntFacts_OH=enc.fit_transform(IntFacts)
IntFacts_OH.todense()

答

如果在设计矩阵上使用OneHotEncoder来获得一次性设计矩阵，那么交互就是列之间的乘法.如果X_1hot是您的一手设计矩阵，其中样本是线条，那么对于二阶交互，您可以编写

If you use the OneHotEncoder on your design matrix to obtain a one-hot design matrix, then interactions are nothing other than multiplications between columns. If X_1hot is your one-hot design matrix, where samples are lines, then for 2nd order interactions you can write

X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1)

会有重复的互动，并且还将包含原始功能.

There will be duplicates of interactions and it will contain the original features as well.

采用任意顺序将使您的设计矩阵爆炸.如果您确实想这样做，则应该考虑使用多项式内核进行内核化，这将使您轻松地达到任意程度.

Going to arbitrary order is going to make your design matrix explode. If you really want to do that, then you should look into kernelizing with a polynomial kernel, which will let you go to arbitrary degrees easily.

使用您提供的数据框，我们可以按照以下步骤进行操作.首先，一种手动的方法可以在数据框架之外构建单项设计:

Using the data frame you present, we can proceed as follows. First, a manual way to construct a one-hot design out of the data frame:

import numpy as np
indicators = []
state_names = []
for column_name in df.columns:
    column = df[column_name].values
    one_hot = (column[:, np.newaxis] == np.unique(column)).astype(float)
    indicators.append(one_hot)
    state_names = state_names + ["%s__%s" % (column_name, state) for state in np.unique(column)]

X_1hot = np.hstack(indicators)

然后，将列名称存储在state_names中，并且指标矩阵为X_1hot.然后我们计算二阶特征

The column names are then stored in state_names and the indicator matrix is X_1hot. Then we calculate the second order features

X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1)

为了知道二阶矩阵的列名，我们像这样构造它们

In order to know the names of the columns of the second order matrix, we construct them like this

from itertools import product
one_hot_interaction_names = ["%s___%s" % (column1, column2) 
                             for column1, column2 in product(state_names, state_names)]

如何从分类变量创建交互设计矩阵?

相关推荐