将多个CSV文件中的列合并为一个文件

问题描述:

我有一堆CSV文件(在下面的示例中只有两个).每个CSV文件都有6列.我想进入每个CSV文件,复制前两列,并将它们作为新列添加到现有CSV文件中.

I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.

到目前为止,我有:

import csv

f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()

for x in range(1,3): #example has 2 csv files, this will be automated
    n=0
    while n<2:
        f=open(str(x)+".csv")
        new_column=[item[n] for item in csv.reader(f)]
        f.close()
        #print d

        new_data = []

        for i, item in enumerate(data):
            try:
                item.append(new_column[i])
                print i
            except IndexError, e:
                item.append("")
            new_data.append(item)

        f = open('combined.csv', 'w')
        csv.writer(f).writerows(new_data)
        f.close()
        n=n+1

这行得通,虽然不漂亮,但是行得通.但是,我有三个小烦恼:

This works, it is not pretty, but it works. However, I have three minor annoyances:

  1. 我每个CSV文件打开两次(每列一次),这不太好

  1. I open each CSV file twice (once for each column), that is hardly elegant

当我打印 combined.csv 文件时,它在每行之后打印一个空行吗?

When I print the combined.csv file, it prints an empty row following each row?

我必须提供一个 combined.csv 文件,其中至少包含与我可能拥有的最大文件一样多的行.既然我真的不知道这个数字是多少,那有点糟

I have to provide a combined.csv file that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks

一如既往,我们将不胜感激!

As always, any help is much appreciated!!

根据要求:1.csv看起来像(模拟数据)

As requested: 1.csv looks like (mock data)

1,a
2,b
3,c
4,d

2.csv看起来像

5,e
6,f
7,g
8,h
9,i

combined.csv文件应该看起来像

the combined.csv file should look like

1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i

import csv
import itertools as IT

filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]    
readers = [csv.reader(f, delimiter=',') for f in handles]

with  open('combined.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:2] # select the columns you want
            if len(row) == 2:
                combined_row.extend(row)
            else:
                combined.extend(['']*2)
        writer.writerow(combined_row)

for f in handles:
    f.close()


IT.izip_longest(* readers,fillvalue = [''] * 2)中行的行:可以通过一个例子来理解:


The line for rows in IT.izip_longest(*readers, fillvalue=['']*2): can be understood with an example:

In [1]: import itertools as IT

In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]

In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

如您所见, IT.izip_longest 的表现非常出色类似于 zip ,不同之处在于它直到消耗了最长的可迭代量时才会停止.默认情况下,它使用 None 填充缺失的项目.

As you can see, IT.izip_longest behaves very much like zip, except that it does not stop until the longest iterable is consumed. It fills in missing items with None by default.

现在,如果 readers 中有3个以上的项目,该怎么办?我们想写

Now what happens if there were more than 3 items in readers? We would want to write

list(IT.izip_longest(readers[0], readers[1], readers[2], ...))

但这很费力,如果我们事先不知道 len(readers),我们甚至无法将省略号( ... )替换为明确的东西.

but that's laborious and if we did not know len(readers) in advance, we wouldn't even be able to replace the ellipsis (...) with something explicit.

Python为此提供了解决方案:星号(又称参数解包)语法:

Python has a solution for this: the star (aka argument unpacking) syntax:

In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

请注意结果 Out [4] 与结果 Out [3] 相同.

Notice the result Out[4] is identical to the result Out[3].

* readers 告诉Python拆开 readers 中的项目并将其作为单独的参数发送到 IT.izip_longest .这就是Python允许我们向函数发送任意数量的参数的方式.

The *readers tells Python to unpack the items in readers and send them along as individual arguments to IT.izip_longest. This is how Python allows us to send an arbitrary number of arguments to a function.