如何在python中读取大型tsv文件并将其转换为csv

问题描述：

我有一个很大的 tsv 文件(大约 12 GB)，我想将其转换为 csv 文件.对于较小的 tsv 文件，我使用以下代码，该代码有效但速度较慢:

I have a large tsv file (around 12 GB) that I want to convert to a csv file. For smaller tsv files, I use the following code, which works but is slow:

import pandas as pd

table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)

然而，这段代码对我的大文件不起作用，内核在中间重置.

However, this code does not work for my large file, and the kernel resets in the middle.

有什么办法可以解决这个问题吗?有谁知道这个任务是否可以用 Dask 而不是 Pandas 来完成?

Is there any way to fix the problem? Does anyone know if the task is doable with Dask instead of Pandas?

我使用的是 Windows 10.

I am using windows 10.

答

不是一次将所有行加载到内存中，而是逐行读取并逐行处理:

Instead of loading all lines at once in memory, you can read line by line and process them one after another:

使用 Python 3.x:

fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')

try:
  line = f.readline()
  while line:
    print(line.translate(table), end = "")
    line = f.readline()

except IOError:
  print("Could not read file: " + fName)

finally:
  f.close()

输入(hrdata.tsv):

Input (hrdata.tsv):

Name    Hire Date       Salary  Sick Days remaining
Graham Chapman  03/15/14        50000.00        10
John Cleese     06/01/15        65000.00        8
Eric Idle       05/12/14        45000.00        10
Terry Jones     11/01/13        70000.00        3
Terry Gilliam   08/12/14        48000.00        7
Michael Palin   05/23/13        66000.00        8

输出:

Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8

命令:

python tsv_csv_convertor.py > new_csv_file.csv

注意:

如果您使用 Unix 环境，只需运行以下命令:

If you use a Unix env, just run the command:

tr '\t' ',' <input.tsv >output.csv

如何在python中读取大型tsv文件并将其转换为csv

相关推荐