Python:如何将字符串数组转换为因子列表

Python:如何将字符串数组转换为因子列表

问题描述:

Python 2.7(numpy)以一系列因素的形式创建级别.

Python 2.7, numpy, create levels in the form of a list of factors.

我有一个数据文件,其中列出了独立变量,最后一列指示类.例如:

I have a data file which list independent variables, the last column indicates the class. For example:

2.34,4.23,0.001, ... ,56.44,2.0,"cloudy with a chance of rain"

使用numpy,我将所有数字列读取到一个矩阵中,并将最后一列读取到一个称为类"的数组中.实际上,我事先不知道类名,所以我不想使用字典.我也不想使用熊猫.这是问题的一个示例:

Using numpy, I read all the numeric columns into a matrix, and the last column into an array which I call "classes". In fact, I don't know the class names in advance, so I do not want to use a dictionary. I also do not want to use Pandas. Here is an example of the problem:

classes = ['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd']
type (classes)
<type 'list'>
classes = numpy.array(classes)
type(classes)
<type 'numpy.ndarray'>
classes
array(['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd'],
      dtype='|S1')
# requirements call for a list like this:
# [0, 1, 2, 2, 1, 0, 3]

请注意,目标类可能非常稀疏,例如,在100,000个案例中有1个可能是'z'.还要注意,这些类可以是任意文本字符串,例如科学名称.

Note that the target class may be very sparse, for example, a 'z', in perhaps 1 out of 100,000 cases. Also note that the classes may be arbitrary strings of text, for example, scientific names.

我正在将python 2.7与numpy结合使用,而我的环境却陷于困境.另外,数据已经过预处理,因此可以缩放并且所有值均有效-我不想在处理数据之前第二次对数据进行预处理以提取唯一的类并构建字典.我真正要寻找的是与R中的 stringAsFactors 参数等效的Python,当脚本读取数据时,该参数会自动将字符串向量转换为因子向量.

I'm using Python 2.7 with numpy, and I'm stuck with my environment. Also, the data has been preprocessed, so it's scaled and all values are valid - I do not want to preprocess the data a second time to extract the unique classes and build a dictionary before I process the data. What I'm really looking for was the Python equivalent to the stringAsFactors parameter in R that automatically converts a string vector to a factor vector when the script reads the data.

不要问我为什么我使用Python而不是R-我会按照我的意思做.

Don't ask me why I'm using Python instead of R - I do what I'm told.

谢谢CC.

您可以使用

You could use np.unique with return_inverse=True to return both the unique class names and a set of corresponding integer indices:

import numpy as np

classes = np.array(['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd'])

classnames, indices = np.unique(classes, return_inverse=True)

print(classnames)
# ['a' 'b' 'c' 'd']

print(indices)
# [0 1 2 2 1 0 0 3]

print(classnames[indices])
# ['a' 'b' 'c' 'c' 'b' 'a' 'a' 'd']

类名将按词汇顺序排序.

The class names will be sorted in lexical order.