如何读取和解析html文件?

问题描述:

我有一个html文件,需要读取它并访问一些值:

I have an html file and need to read it and access to some values :

myHtml = 'toto.html';
readFile = fileread(myHtml);

现在要解析html文件,您知道是否有可能将html转换为xml,然后使用xpath吗?

now to parse the html file , do you know if it's possible to convert html to xml and then use xpath ?

我不建议尝试将HTML转换为XML.它们是不同的格式,您可能会被烫伤. HTML解析器已经存在,因此我们可以直接使用它们.

I would not recommend attempting to convert HTML to XML. They are different formats, and you are likely to get burned. HTML parsers exist, so we can use those directly.

此外,仅出于完整性考虑,请勿尝试使用正则表达式解析HTML.在Matlab中存在有关解析HTML的Stack Overflow问题,答案中建议使用正则表达式.请无辜的小猫帮个忙,然后把它们调出来.

Also, just for completeness, don't try and parse HTML with regex. There are Stack Overflow questions about parsing HTML in Matlab in which the answers recommend regex. Do innocent kittens a favor and tune them out.

不幸的是,看起来Matlab的库中没有HTML解析器.

Unfortunately, it doesn't look like Matlab has an HTML parser as part of it's library.

幸运的是,您可以在Matlab中轻松利用Java代码!
这样,Java HTML解析器就是公平的游戏.查看jsoup或jtidy.在这个问题中打一下.

Fortunately, you can leverage Java code with ease in Matlab!
With that, Java HTML parsers are fair game. Look into jsoup or jtidy. Poke around this question.

实际上,看着这个问题,再加上 HTML解析器的比较 Wikipedia文章(感谢@Daniel R!),看起来HTMLCleaner或Jtidy可能会将HTML清除为XML.再说一次,我不会打扰,只会直接解析HTML.

Actually, looking at that question, plus the Comparison of HTML parsers Wikipedia article (thanks @Daniel R!) it looks like HTMLCleaner or Jtidy might clean HTML to XML. Again, I wouldn't bother and would simply parse HTML directly.