使用jsoup将HTML解析为格式化的纯文本

使用jsoup将HTML解析为格式化的纯文本

问题描述:

我正在从事一个Maven项目,该项目使我能够解析网站中的html数据.我可以使用下面的代码来解析它:

I was working on a maven project that allows me to parse a html data from a website. I was able to parse it using this code below:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

到目前为止,我没有任何问题.我可以解析html数据. 我正在从jsoup中使用select方法,并使用"div.col-section"检索数据,这意味着我正在使用class为col-section的div元素进行查找. 我想在textarea中打印数据.即使网站上的实际数据超过一个段落,我得到的结果还是一个巨大的段落.那么,如何像在网站上一样解析数据呢?

So far I have no problems. I can parse the html data. I was using select method from jsoup and retrieving data using "div.col-section" which means I'm looking for div element with the class is col-section. I wanted to print the data in a textarea. The result that I have is a huge one paragraph even though the real data on the website is more than one paragraphs. So how to parse the data just like the one on the website?

未格式化的原因是格式化为HTML格式-使用<p><ol>标记等.在.text()上调用块元素会丢失该格式.

The reason that it is not formatted is that the formatting is in the HTML -- with <p> and <ol> tags etc. Calling .text() on a block element loses that formatting.

Jsoup有一个示例 HTML到纯文本转换器,您可以通过将div元素作为焦点来适应您的需求.

Jsoup has an example HTML to Plain Text convertor which you can adapt to your needs -- by providing the div element as the focus.

或者,您可以选择 "div.col-section > *" ,然后遍历每个元素,然后打印出来带有换行符的文本.

Alternatively, you could just select "div.col-section > *", and iterate through each Element, and print out that text with a newline.