str和对象类型之间的熊猫区别

str和对象类型之间的熊猫区别

问题描述:

Numpy似乎在strobject类型之间进行了区分.例如我可以做:::

Numpy seems to make a distinction between str and object types. For instance I can do ::

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

dtype('S')和dtype('O')分别对应于strobject.

Where dtype('S') and dtype('O') corresponds to str and object respectively.

但是,熊猫似乎缺乏这种区分,并迫使str转换为object. ::

However pandas seem to lack that distinction and coerce str to object. ::

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

将类型强制为dtype('S')也不起作用. ::

Forcing the type to dtype('S') does not help either. ::

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

此行为是否有任何解释?

Is there any explanation for this behavior?

Numpy的字符串dtypes不是python字符串.

Numpy's string dtypes aren't python strings.

因此,pandas故意使用本机python字符串,这需要对象dtype.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

首先,让我演示一下numpy的字符串不同的含义:

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

现在,"x"是numpy字符串dtype(固定宽度,类似c的字符串),而y是本地python字符串数组.

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

如果我们尝试超过7个字符,则会立即看到差异.字符串dtype版本将被截断:

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

对象dtype版本可以是任意长度:

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

接下来,尽管还有一个Unicode固定长度字符串dtype,但|S dtype字符串也无法正确保存unicode.现在,我将跳过一个示例.

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

最后,numpy的字符串实际上是可变的,而Python字符串则不是.例如:

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

由于所有这些原因,pandas选择不再允许类似C的固定长度字符串作为数据类型.如您所见,在pandas中,尝试将python字符串强制转换为固定的numpy字符串将不起作用.相反,它始终使用本机python字符串,对于大多数用户而言,它们的行为更为直观.

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.