c#页面加载完成后下载html字符串

问题描述:

我正在尝试使用循环来下载一堆 html 页面并抓取内部数据.但是这些页面在加载时会运行一些 javascript 作业.所以我认为使用 webclient 可能不是一个好的选择.但是,如果我使用如下所示的 webBrowser.它在循环中第一次调用后返回空的 html 字符串.

I am trying to use a loop to download a bunch of html pages and scrap inside data. But those pages have some javascript job runing when loading. So I am thinking using webclient may not be a good choice. But if I use webBrowser like below. it return empty html string after first call in the loop.

WebBrowser wb = new WebBrowser();
        wb.ScrollBarsEnabled = false;
        wb.ScriptErrorsSuppressed = true;
        wb.Navigate(url);
        while (wb.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); Thread.Sleep(1000); }
        html = wb.Document.DomDocument.ToString();

您说的对,WebClient &所有其他 HTTP 客户端接口将完全忽略 JavaScript;毕竟它们都不是浏览器.

Your are correct that WebClient & all of the other HTTP client interfaces will completely ignore JavaScript; none of them are Browsers after all.

你想要:

var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;

请注意,如果您通过 WebBrowser 加载,则不需要抓取原始标记;您可以使用 GetElementById/TagName 等 DOM 方法.

Note that if you load via a WebBrowser you don't need to scrape the raw markup; you can use DOM methods like GetElementById/TagName and so on.

while 循环是非常 VBScript 的,有一个 DocumentCompleted 事件,您应该将代码连接到其中.

The while loop is very VBScript, there is a DocumentCompleted event you should wire your code into.

private void Whatever()
{
    WebBrowser wb = new WebBrowser();
    wb.DocumentCompleted += Wb_DocumentCompleted;

    wb.ScriptErrorsSuppressed = true;
    wb.Navigate("http://stackoverflow.com");
}

private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    var wb = (WebBrowser)sender;

    var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
    var domd = wb.Document.GetElementById("copyright").InnerText;
    /* ... */
}