如何使用自定义拆分实现扫描仪

如何使用自定义拆分实现扫描仪

问题描述:

I have a log file, and I need to parse each record in it using golang. Each record begin with "#", and a record can span one or more lines :

# Line1
# Line2
Continued line2
Continued line2
# line3
.....

Some code :), I'm a beginner

   f, _ := os.Open(mylog)
    scanner := bufio.NewScanner(f)
    var queryRec string

    for scanner.Scan() {
            line := scanner.Text()

            if strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    queryRec = line
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    fmt.Println("There is a big problem!!!")
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec += line
            } else if strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec = line
            }
    }

Thanks,

The Scanner type has a function called Split which allows you to pass a SplitFunc to determine how the scanner will split the given byte slice. The default SplitFunc is the ScanLines which you can see the implementation source. From this point you can write your own SplitFunc to break the bufio.Reader content based on your specific format.

func crunchSplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error) {

    // Return nothing if at end of file and no data passed
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }

    // Find the index of the input of a newline followed by a 
    // pound sign.
    if i := strings.Index(string(data), "
#"); i >= 0 {
        return i + 1, data[0:i], nil
    }

    // If at end of file with data return the data
    if atEOF {
        return len(data), data, nil
    }

    return
}

You can see the full implementation of the example at https://play.golang.org/p/ecCYkTzme4. The documentaiton is going to provide all the insight needed to implement something like this.

Ben Campbell's answer wrapped into a func that returns a splitfunc for a substring:

demo on play.golang.org

Improvement suggestions welcome

// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("
# "))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {

    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

        // Return nothing if at end of file and no data passed
        if atEOF && len(data) == 0 {
            return 0, nil, nil
        }

        // Find the index of the input of the separator substring
        if i := strings.Index(string(data), substring); i >= 0 {
            return i + len(substring), data[0:i], nil
        }

        // If at end of file with data return the data
        if atEOF {
            return len(data), data, nil
        }

        return
    }
}

Slightly optimized solution of Ben Campbell and sto-b-doo

Conversion of byte slice to string appears to be quite heavy operation.

In my app for log processing it became a bottleneck.

Just keeping data in bytes gives ~1500% performance boost to my app.

func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    searchBytes := []byte(substring)
    searchLen := len(searchBytes)
    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        dataLen := len(data)

        // Return nothing if at end of file and no data passed
        if atEOF && dataLen == 0 {
            return 0, nil, nil
        }

        // Find next separator and return token
        if i := bytes.Index(data, searchBytes); i >= 0 {
            return i + searchLen, data[0:i], nil
        }

        // If we're at EOF, we have a final, non-terminated line. Return it.
        if atEOF {
            return dataLen, data, nil
        }

        // Request more data.
        return 0, nil, nil
    }
}