如何在python中抓取动态网站(不使用硒)

问题描述：

Selenium 是否有任何库或替代方法可以从动态(javascript 渲染)网站中抓取数据?

Are there any libraries or alternative methods to Selenium to scrape data from dynamic (javascript-rendered) websites?

我遇到的问题是，当我使用带有 selenium 的 webdriver 时，许多网站可以很容易地检测到.我已经做了一些事情，比如在我的 webdrver 中更改我的 cdc_ 变量，但我仍然被检测到.我一直在研究使用 Selenium 无法检测到的方法，但似乎无法做到.

The issue I've run into is that many websites can detect when I'm using a webdriver with selenium very easily. I've done things such as change my cdc_ variable within my webdrver, and I am still detected. I've been researching ways to be undetectable using Selenium, but it seems impossible to do.

因此，我正在寻找一种无需使用 Selenium 即可抓取动态网站的方法.任何建议都有帮助.

So, I'm looking for a way to scrape dynamic websites without using Selenium. Any suggestions help.

谢谢！

答

如果您不想使用 selenium 抓取动态网站.我知道的两种方式:

If you don't want to use selenium to scrape dynamic website. Two ways I know:

找到 ajax API 并发送 GET 请求.那只能使用 requests 模块或 urllib 模块可以做到这一点.(我推荐这个，但它需要采取一些措施.)

Find the ajax API and send GET request.That's could only use requests module or urllib module could do that.(I recommend this but it needs to take some measure.)

如果您的 python 版本 >= 3.6，您可以尝试使用 requests-html 模块.据我所知，它可以获得一些由 JavaScript 呈现的文本.

If your python verions >= 3.6,you could try to use requests-html module.As far as I know,it could get some text that's rendered by JavaScript.

如何在python中抓取动态网站(不使用硒)

相关推荐