如何在python中读取大型tsv文件并将其转换为csv
我有一个很大的 tsv
文件(大约 12 GB),我想将其转换为 csv
文件.对于较小的 tsv
文件,我使用以下代码,该代码有效但速度较慢:
I have a large tsv
file (around 12 GB) that I want to convert to a csv
file. For smaller tsv
files, I use the following code, which works but is slow:
import pandas as pd
table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)
然而,这段代码对我的大文件不起作用,内核在中间重置.
However, this code does not work for my large file, and the kernel resets in the middle.
有什么办法可以解决这个问题吗?有谁知道这个任务是否可以用 Dask 而不是 Pandas 来完成?
Is there any way to fix the problem? Does anyone know if the task is doable with Dask instead of Pandas?
我使用的是 Windows 10.
I am using windows 10.
不是一次将所有行加载到内存中,而是逐行读取并逐行处理:
Instead of loading all lines at once in memory, you can read line by line and process them one after another:
使用 Python 3.x:
fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')
try:
line = f.readline()
while line:
print(line.translate(table), end = "")
line = f.readline()
except IOError:
print("Could not read file: " + fName)
finally:
f.close()
输入(hrdata.tsv):
Input (hrdata.tsv):
Name Hire Date Salary Sick Days remaining
Graham Chapman 03/15/14 50000.00 10
John Cleese 06/01/15 65000.00 8
Eric Idle 05/12/14 45000.00 10
Terry Jones 11/01/13 70000.00 3
Terry Gilliam 08/12/14 48000.00 7
Michael Palin 05/23/13 66000.00 8
输出:
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8
命令:
python tsv_csv_convertor.py > new_csv_file.csv
注意:
如果您使用 Unix
环境,只需运行以下命令:
If you use a Unix
env, just run the command:
tr '\t' ',' <input.tsv >output.csv