读取一个巨大的.csv文件

问题描述:

我目前正在尝试从Python 2.7中的.csv文件读取数据,最多有100万行和200列(文件范围从100mb到1.6gb)。我可以这样做(非常慢)对于300,000行以下的文件,但一旦我走上,我得到内存错误。我的代码如下所示:

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

    def getdata(filename, criteria):
        data=[]
        for criterion in criteria:
            data.append(getstuff(filename, criteron))
        return data

    def getstuff(filename, criterion):
        import csv
        data=[]
        with open(filename, "rb") as csvfile:
            datareader=csv.reader(csvfile)
            for row in datareader: 
                if row[3]=="column header":
                    data.append(row)
                elif len(data)<2 and row[3]!=criterion:
                    pass
                elif row[3]==criterion:
                    data.append(row)
                else:
                    return data

getstuff函数中的else子句的原因是所有符合条件的元素将一起列在csv文件中,所以我离开循环,当我经过它们以节省时间。

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

我的问题是:


  1. How can I manage to get this to work with the bigger files?

有什么办法可以让它更快吗?

Is there any way I can make it faster?

我的电脑有8GB内存,运行64位Windows 7,处理器是3.40 GHz(不确定你需要什么信息)。

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

非常感谢任何帮助。

您正在将所有行读入列表, 。 不要这样做

You are reading all rows into a list, then processing that list. Don't do that.

在生成行时处理您的行。如果您需要首先过滤数据,请使用生成函数:

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        count = 0
        for row in datareader:
            if row[3] in ("column header", criterion):
                yield row
                count += 1
            elif count < 2:
                continue
            else:
                return

你的过滤器测试;逻辑是相同的但更简洁。

I also simplified your filter test; the logic is the same but more concise.

现在可以直接循环 getstuff()。在 getdata()中执行相同操作:

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

现在直接在代码中循环 getdata()

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

您现在只能在内存中持有一行,而不是按照每条标准的上千行。

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield 使函数成为生成函数,这意味着它不会做任何工作,直到你开始循环。

yield makes a function a generator function, which means it won't do any work until you start looping over it.