如何提高Go中逐行读取大文件的速度

如何提高Go中逐行读取大文件的速度

问题描述:

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

我正在尝试找出最快的方法来逐行读取大文件并检查该行 包含一个字符串。 我正在测试的文件大约为680mb p>

 包main 
 
 import(
“ bufio” 
  “ fmt” 
“ os” 
“字符串” 
)
 
 func main(){
f,err:= os.Open(“ ./ crackstation-human-only.txt”)
 \  n扫描仪:= bufio.NewScanner(f)
如果出错!= nil {
恐慌(err)
} 
推迟f.Close()
 
扫描程序.Scan(){
如果是字符串 .contains(scanner.Text(),“ Iforgotmypassword”){
 fmt.Println(scanner.Text())
} 
} 
} 
  code>  pre> 
 
 构建程序并在我的计算机上对其计时后,它将运行3秒
 。/速度3.13s用户1.25s系统122%cpu 3.563总计 code>  p> 
 
 

增加缓冲区后 p>

  buf:= make([] byte,64 * 1024)
scanner.Buffer(buf,bufio.MaxScanTokenSize)
  code>  
 
 

效果会好一些 。/速度2.47s用户0.25s系统104%cpu 2.609 tota l code> p>

我知道它会变得更好,因为其他工具无需任何索引就可以在一秒钟内完成它。 p>

0.33s用户0.14s系统94%cpu 0.501总计 code> p> div>

LAST EDIT

This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
)

func main() {
    dat, _ := ioutil.ReadFile("./jumble.txt")
    i := bytes.Index(dat, []byte("Iforgotmypassword"))
    if i != -1 {
        var x int
        var y int
        for x = i; x > 0; x-- {
            if dat[x] == byte('
') {
                break
            }
        }
        for y = i; y < len(dat); y++ {
            if dat[y] == byte('
') {
                break
            }
        }
        fmt.Println(string(dat[x : y+1]))
    }
}
real    0m0.421s
user    0m0.068s
sys     0m0.352s

ORIGINAL ANSWER

If you just need to see if the string is in a file, why not use regex?

Note: I kept the data as a byte array instead of converting to string.

package main

import (
    "fmt"
    "io/ioutil"
    "regexp"
)

var regex = regexp.MustCompile(`Ilostmypassword`)

func main() {
    dat, _ := ioutil.ReadFile("./jumble.txt")
    if regex.Match(dat) {
        fmt.Println("Yes")
    }
}

jumble.txt is a 859 MB of jumbled text with newlines included.

Running with time ./code I get:

real    0m0.405s
user    0m0.064s
sys     0m0.340s

To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

Given the following code:

dat, _ := ioutil.ReadFile("./jumble.txt")
sdat := bytes.Split(dat, []byte{'
'})
for _, l := range sdat {
    if bytes.Equal([]byte("Iforgotmypassword"), l) {
        fmt.Println("Yes")
    }
}

I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.

You might try using goroutines to process multiple lines in parallel:

lines := make(chan string, numWorkers * 2) // give the channel enough room to put lots of things in so the reader isn't blocked

go func(scanner *bufio.Scanner, out <-chan string) {
  for scanner.Scan() {
    out <- scanner.Text()
  }
  close(out)
}(scanner, lines)

var wg sync.WaitGroup
wg.Add(numWorkers)

for i := 0; i < numWorkers; i++ {
  go func(in chan<- string) {
    defer wg.Done()
    for text := range in {
      if strings.Contains(text, "Iforgotmypassword") {
        fmt.Println(scanner.Text())
      }
    }
  }(lines)
}

wg.Wait()

I'm not sure how much this will really speed things up as it depends on what kind of hardware you have available; it sounds like you're looking for a more than 5x speed improvement, so you might notice if you're running something that can support 8 parallel worker threads. Feel free to use lots of worker-goroutines. Good luck.

Using my own 700MB test file with your original, time was just over 7 seconds

With grep it was 0.49 seconds

With this program (which doesn't print out the line, it just says yes) 0.082 seconds

package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
    "os"
)

func check(e error) {
    if e != nil {
        panic(e)
    }
}
func main() {
    find := []byte(os.Args[1])
    dat, err := ioutil.ReadFile("crackstation-human-only.txt")
    check(err)
    if bytes.Contains(dat, find) {
        fmt.Print("yes")
    }
}