从Go中具有可变行尾的文件中读取行

问题描述:

How can I read lines from a file where the line endings are carriage return (CR), newline (NL), or both?

The PDF specification allows lines to end with CR, LF, or CRLF.

  • bufio.Reader.ReadString() and bufio.Reader.ReadBytes() allow a single delimiter byte.

  • bufio.Scanner.Scan() handles optionally preceded by , but not a lone .

    The end-of-line marker is one optional carriage return followed by one mandatory newline.

Do I need to write my own function that uses bufio.Reader.ReadByte()?

如何从文件的行尾为回车符(CR),换行符(NL), p>

PDF规范允许行以CR,LF或CRLF结尾。 p>

  • bufio.Reader.ReadString() code>和 bufio.Reader.ReadBytes() code>允许使用一个分隔符字节。 p> li>

  • bufio.Scanner.Scan() code>处理 code>(可选)后跟 code>,但不能处理单独的 code>。 p>

    行尾标记是一个可选的回车符,后跟一个强制换行符。 p> blockquote> li> ul>

    我需要编写自己的使用 bufio.Reader.ReadByte() code>的函数吗? p> div>

You can write custom bufio.SplitFunc for bufio.Scanner. E.g:

// Mostly bufio.ScanLines code:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "
"); i >= 0 {
        if data[i] == '
' {
            // We have a line terminated by single newline.
            return i + 1, data[0:i], nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '
' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

And use it like:

scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)