如何获取HTML元素的内容
I'm quite new to Go and I'm struggling a little at the moment with parsing some html.
The HTML looks like:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div>something</div>
<div id="publication">
<div>I want <span>this</span></div>
</div>
<div>
<div>not this</div>
</div>
</body>
</html>
And I want to get this as a string:
<div>I want <span>this</span></div>
I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.
I've also had a go with goquery which seems perfect, code:
doc, err := goquery.NewDocument("{url}")
if err != nil {
log.Fatal(err)
}
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Review %d: %s
", i, s.Html())
})
But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).
I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...
Any help would be amazing!
You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.
Please add this to your code and it will work fine:
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
inside_html,_ := s.Html() //underscore is an error
fmt.Printf("Review %d: %s
", i, inside_html)
})