如何在 C# 中将 HTML 转换为文本?
我正在寻找将 HTML 文档转换为纯文本的 C# 代码.
I'm looking for C# code to convert an HTML document to plain text.
我不是在寻找简单的标签剥离,而是在合理保留原始布局的情况下输出纯文本的东西.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
输出应如下所示:
我已经查看了 HTML Agility Pack,但我认为这不是我所需要的.有人有其他建议吗?
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.
编辑 2:感谢大家的建议.FlySwat 向我提示了我想去的方向.我可以使用 System.Diagnostics.Process
类运行带有-dump"开关的 lynx.exe 将文本发送到标准输出,并使用 ProcessStartInfo.UseShellExecute = false 捕获标准输出
和 ProcessStartInfo.RedirectStandardOutput = true
.我将把所有这些都封装在一个 C# 类中.这段代码只会偶尔被调用,所以我不太关心生成一个新进程与在代码中执行它.另外,Lynx 速度很快!!
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process
class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false
and ProcessStartInfo.RedirectStandardOutput = true
. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
您正在寻找的是一种输出文本的文本模式 DOM 渲染器,很像 Lynx 或其他文本浏览器......你会期望的.
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.