如何获取HTML元素的内容

问题描述：

I'm quite new to Go and I'm struggling a little at the moment with parsing some html.

The HTML looks like:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>

    <div>something</div>

    <div id="publication">
        <div>I want <span>this</span></div>
    </div>

    <div>
        <div>not this</div>
    </div>

</body>
</html>

And I want to get this as a string:

<div>I want <span>this</span></div>

I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.

I've also had a go with goquery which seems perfect, code:

doc, err := goquery.NewDocument("{url}")
if err != nil {
    log.Fatal(err)
}

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    fmt.Printf("Review %d: %s
", i, s.Html())
})

But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).

I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...

Any help would be amazing!

答

You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.

Please add this to your code and it will work fine:

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    inside_html,_ := s.Html() //underscore is an error
    fmt.Printf("Review %d: %s
", i, inside_html)
})

相关推荐