str和对象类型之间的熊猫区别

问题描述：

Numpy似乎在str和object类型之间进行了区分.例如我可以做:::

Numpy seems to make a distinction between str and object types. For instance I can do ::

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

dtype('S')和dtype('O')分别对应于str和object.

Where dtype('S') and dtype('O') corresponds to str and object respectively.

但是，熊猫似乎缺乏这种区分，并迫使str转换为object. ::

However pandas seem to lack that distinction and coerce str to object. ::

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

将类型强制为dtype('S')也不起作用. ::

Forcing the type to dtype('S') does not help either. ::

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

此行为是否有任何解释?

Is there any explanation for this behavior?

答

Numpy的字符串dtypes不是python字符串.

Numpy's string dtypes aren't python strings.

因此，pandas故意使用本机python字符串，这需要对象dtype.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

首先，让我演示一下numpy的字符串不同的含义:

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

现在，"x"是numpy字符串dtype(固定宽度，类似c的字符串)，而y是本地python字符串数组.

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

如果我们尝试超过7个字符，则会立即看到差异.字符串dtype版本将被截断:

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

对象dtype版本可以是任意长度:

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

接下来，尽管还有一个Unicode固定长度字符串dtype，但|S dtype字符串也无法正确保存unicode.现在，我将跳过一个示例.

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

最后，numpy的字符串实际上是可变的，而Python字符串则不是.例如:

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

由于所有这些原因，pandas选择不再允许类似C的固定长度字符串作为数据类型.如您所见，在pandas中，尝试将python字符串强制转换为固定的numpy字符串将不起作用.相反，它始终使用本机python字符串，对于大多数用户而言，它们的行为更为直观.

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

str和对象类型之间的熊猫区别

相关推荐