c#页面加载完成后下载html字符串
我正在尝试使用循环来下载一堆 html 页面并抓取内部数据.但是这些页面在加载时会运行一些 javascript 作业.所以我认为使用 webclient 可能不是一个好的选择.但是,如果我使用如下所示的 webBrowser.它在循环中第一次调用后返回空的 html 字符串.
I am trying to use a loop to download a bunch of html pages and scrap inside data. But those pages have some javascript job runing when loading. So I am thinking using webclient may not be a good choice. But if I use webBrowser like below. it return empty html string after first call in the loop.
WebBrowser wb = new WebBrowser();
wb.ScrollBarsEnabled = false;
wb.ScriptErrorsSuppressed = true;
wb.Navigate(url);
while (wb.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); Thread.Sleep(1000); }
html = wb.Document.DomDocument.ToString();
您说的对,WebClient &所有其他 HTTP 客户端接口将完全忽略 JavaScript;毕竟它们都不是浏览器.
Your are correct that WebClient & all of the other HTTP client interfaces will completely ignore JavaScript; none of them are Browsers after all.
你想要:
var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
请注意,如果您通过 WebBrowser 加载,则不需要抓取原始标记;您可以使用 GetElementById/TagName
等 DOM 方法.
Note that if you load via a WebBrowser you don't need to scrape the raw markup; you can use DOM methods like GetElementById/TagName
and so on.
while 循环是非常 VBScript 的,有一个 DocumentCompleted
事件,您应该将代码连接到其中.
The while loop is very VBScript, there is a DocumentCompleted
event you should wire your code into.
private void Whatever()
{
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += Wb_DocumentCompleted;
wb.ScriptErrorsSuppressed = true;
wb.Navigate("http://stackoverflow.com");
}
private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var wb = (WebBrowser)sender;
var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
var domd = wb.Document.GetElementById("copyright").InnerText;
/* ... */
}