如何使用自定义拆分实现扫描仪
I have a log file, and I need to parse each record in it using golang. Each record begin with "#", and a record can span one or more lines :
# Line1
# Line2
Continued line2
Continued line2
# line3
.....
Some code :), I'm a beginner
f, _ := os.Open(mylog)
scanner := bufio.NewScanner(f)
var queryRec string
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
queryRec = line
} else if !strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
fmt.Println("There is a big problem!!!")
} else if !strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
queryRec += line
} else if strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
queryRec = line
}
}
Thanks,
The Scanner
type has a function called Split which allows you to pass a SplitFunc
to determine how the scanner will split the given byte slice. The default SplitFunc
is the ScanLines
which you can see the implementation source. From this point you can write your own SplitFunc
to break the bufio.Reader
content based on your specific format.
func crunchSplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error) {
// Return nothing if at end of file and no data passed
if atEOF && len(data) == 0 {
return 0, nil, nil
}
// Find the index of the input of a newline followed by a
// pound sign.
if i := strings.Index(string(data), "
#"); i >= 0 {
return i + 1, data[0:i], nil
}
// If at end of file with data return the data
if atEOF {
return len(data), data, nil
}
return
}
You can see the full implementation of the example at https://play.golang.org/p/ecCYkTzme4. The documentaiton is going to provide all the insight needed to implement something like this.
Ben Campbell's answer wrapped into a func that returns a splitfunc for a substring:
Improvement suggestions welcome
// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("
# "))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
// Return nothing if at end of file and no data passed
if atEOF && len(data) == 0 {
return 0, nil, nil
}
// Find the index of the input of the separator substring
if i := strings.Index(string(data), substring); i >= 0 {
return i + len(substring), data[0:i], nil
}
// If at end of file with data return the data
if atEOF {
return len(data), data, nil
}
return
}
}
Slightly optimized solution of Ben Campbell and sto-b-doo
Conversion of byte slice to string appears to be quite heavy operation.
In my app for log processing it became a bottleneck.
Just keeping data in bytes gives ~1500% performance boost to my app.
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
searchBytes := []byte(substring)
searchLen := len(searchBytes)
return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
dataLen := len(data)
// Return nothing if at end of file and no data passed
if atEOF && dataLen == 0 {
return 0, nil, nil
}
// Find next separator and return token
if i := bytes.Index(data, searchBytes); i >= 0 {
return i + searchLen, data[0:i], nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return dataLen, data, nil
}
// Request more data.
return 0, nil, nil
}
}