使用HTMLAgilityPack进行XHTML解析

问题描述:

我在使用HTMLAgilityPack找到的元素中有以下元素的列表.

I have a list of the following elements inside a element that I have found using HTMLAgilityPack.

<option value="67"><span style="color: #cc0000;">Horde</span> Leveling / Dailies & Event Guide ($50.00)</option>

我需要做的是从标记中解析所有文本,而其中没有所有的巨型字符.我已经尝试了(看似!)一切,但总会看起来像这样:

What I need to do is parse all the text out of the tag, without all the mumbo jumbo in there. I've tried (seemingly!) everything, but it always comes out looking like this:

部落
练级/日用&活动指南($ 50.00)

Horde
Leveling / Dailies & Event Guide ($50.00)

有时甚至是:

部落
调平
/日报活动指南($ 50.00)

Horde
Leveling
/ Dailies & Event Guide ($50.00)

以及其他一些类似的变体.我什至可以将字符串中的每个字符都打印为一个字节,而且还没有发现任何换行符或提要,只有我期望的换行符或提要,以及正常的字母和空格.这是html的完整参考源,直接从页面复制而来.

and a couple other variations like that. I've even gone so far as to print out each character in the string as a byte, and I haven't found any linebreaks or feeds, only what I expected, normal letters and spaces. Here's the full source of the html for reference, copied straight from the page.

<option value="13"><span style="color: #0000ff;">Alliance</span> Leveling Guide ($30.00)</option>


<option value="12"><span style="color: #cc0000;">Horde</span> Leveling Guide ($30.00)</option>

<option value="46"><span style="color: #cc0000;">Horde</span> Dailies & Events Guide ($25.00)</option>

<option value="67"><span style="color: #cc0000;">Horde</span> Leveling / Dailies & Event Guide ($50.00)</option>


<option value="11"><span style="color: #0000ff;">Alliance</span> &amp; <span style="color: #cc0000;">Horde</span> Leveling Guide ($50.00)</option>

<option value="97"><span style="color: #0000ff;">Alliance</span> Achievements & Professions Guide ($20.00)</option>

<option value="98"><span style="color: #cc0000;">Horde</span> Achievements & Professions Guide ($20.00)</option>


<option value="99"><span style="color: #0000ff;">Alliance</span> &amp; <span style="color: #cc0000;">Horde</span> Achievements & Professions Guide ($30.00)</option>

默认情况下,Html Agility Pack将<OPTION>标记视为空",这意味着它不需要结束</OPTION>,即为什么在这种情况下,使用XPATH并不容易.您可以使用HtmlNode.ElementFlags集合更改此设置.

By default, the <OPTION> tag is treated by Html Agility Pack as a "Empty", which means it does not need a closing </OPTION>, that's why in this case, it's not easy to catch with XPATH. You can change this using the HtmlNode.ElementFlags collection.

以下是应该执行您想要的代码:

Here is a code that should do what you want:

HtmlDocument doc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
doc.LoadHtml(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//option"))
{
    Console.WriteLine(node.InnerText);
}