pandas read_csv 并使用 usecols 过滤列

问题描述：

我有一个 csv 文件，当我使用 usecols 过滤列并使用多个索引时，pandas.read_csv 无法正确输入该文件.

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

我希望 df1 和 df2 应该是相同的，除了缺少虚拟列，但列的标签错误.日期也被解析为日期.

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

使用列号而不是名称给我带来了同样的问题.我可以通过在 read_csv 步骤之后删除虚拟列来解决这个问题，但我试图了解出了什么问题.我正在使用熊猫 0.10.1.

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

修复了错误的标头使用.

edit: fixed bad header usage.

答

解决方案在于理解这两个关键字参数:

The solution lies in understanding these two keyword arguments:

names 仅当您的文件中没有标题行并且您想使用列名而不是整数索引指定其他参数(例如 usecols)时才需要.
usecols 应该在将整个 DataFrame 读入内存之前提供一个过滤器；如果使用得当，阅读后应该永远不需要删除列.

names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

因此，因为您有一个标题行，所以传递 header=0 就足够了，另外传递 names 似乎会混淆 pd.read_csv.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

从第二次调用中删除 names 得到所需的输出:

Removing names from the second call gives the desired output:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

这给了我们:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

pandas read_csv 并使用 usecols 过滤列

相关推荐