熊猫read_csv并使用usecols过滤列

问题描述：

我有一个csv文件，当我用usecols过滤列并使用多个索引时，pandas.read_csv不能正确输入该文件.

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

我希望df1和df2应该相同，除了缺少虚拟列外，但这些列的标签错误.日期也被解析为日期.

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

使用列号而不是名称也会给我带来同样的问题.我可以通过在read_csv步骤之后删除虚拟列来解决此问题，但是我试图了解出了什么问题.我正在使用熊猫0.10.1.

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

修复错误的标头用法.

答

@chip的回答完全忽略了两个关键字参数的含义.

The answer by @chip completely misses the point of two keyword arguments.

名称仅在没有标题并且您要使用列名而不是整数索引指定其他参数时才是必需的.
usecols 应该在将整个DataFrame读入内存之前提供过滤器；如果使用得当，则阅读后就不必删除列了.

names is only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

此解决方案纠正了这些怪异现象:

This solution corrects those oddities:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

哪个给了我们

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

熊猫read_csv并使用usecols过滤列

相关推荐