使用 Python 下载 URL 的 html - 但启用了 javascript

问题描述:

我正在尝试下载此页面 这样我就可以抓取搜索结果.但是,当我下载页面并尝试使用 BeautifulSoup 处理它时,我发现页面的某些部分(例如,搜索结果)未包含在内,因为该站点检测到未启用 javascript.

I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled.

有没有办法下载在 Python 中启用了 javascript 的 URL 的 HTML?

Is there a way to download the HTML of a URL with javascript enabled in Python?

@kstruct:我的首选方法是使用已经编写好的浏览器,而不是使用 QtWebKit 和 PyQt4 编写完整的浏览器.有 PhantomJS (C++) 项目,或 PyPhantomJS (Python).基本上 Python 是 QtWebKit 和 Python.

@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

它们都是无头浏览器,您可以直接从 JavaScript 进行控制.Python 版本有一个插件系统,允许您扩展核心,以便在您需要时提供其他功能.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

这是 PyPhantomJS 的示例脚本(带有 saveToFile 插件)

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

有用的链接:
API 参考 |如何编写插件