

我有4 GB的文件,我需要在做一些操作。我有一个Bash脚本要做到这一点,但它似乎猛砸不适合阅读大的数据文件到一个数组。所以,我决定分手我使用awk文件。

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.


for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done


However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).


QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?

假设 binsize 是你每块需要的行数,你可以只维护和重置线柜台为您逐步通过文件和awk中设置,而不是使用shell重定向的备用输出文件。

Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.

awk -v binsize=60000 '
  count > binsize {
    if (filenum>1) {
    outfile="output_chunk_" filenum ".txt"
    print > outfile
' my_large_file.txt

我没有实际测试过这个code,因此,如果不逐字工作,至少也应该给你使用策略的想法。 : - )

I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)

我们的想法是,只要我们的一大块行数超过 binsize 我们将逐步通过文件,更新文件名中的变量。请注意,关闭(OUTFILE)不是绝对必要的,因为当然,awk将关闭所有打开的文件,退出的时候,但它可以节省你的每一个记忆几个字节打开的文件句柄(其中,如果你有很多很多的输出文件将只显著)。

The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).


That said, you could do almost exactly the same thing in bash alone:

#!/usr/bin/env bash


filenum=1; count=0

while read -r line; do

  if [ $count -gt $binsize ]; then


  printf '%s\n' "$line" >> $outfile

done < my_large_file.txt


(Also untested.)

虽然我的期望的awk的解决方案比bash的更快,它可能不会伤害做你自己的基准。 :)

And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)