阅读Java中网页的全部内容
我想用Java程序抓取以下链接的全部内容.第一页没问题,但是当我要抓取下一页的数据时,有与第一页相同的源代码.因此,简单的HTTP Get根本无济于事.
I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
有没有更简单的方法可以使用Java代码读取页面的全部内容?我已经在寻找解决方案,但是找不到适合的解决方案.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
感谢您的帮助,
亲切的问候.
Appreciate your help,
kind regards.
使用Chrome网络日志(或任何其他浏览器中的类似工具),您可以识别XHR请求,以加载页面上显示的实际数据.我已经删除了一些查询参数,但是本质上,请求看起来像这样:
Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
有用的是,查询参数看起来很容易理解. order=asc&limit=10&page=1
部分看起来很容易调整以返回所需的结果.您可以调整page
参数以抓取连续的数据页.
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1
part looks like it would be easy to adjust to return your desired results. You could adjust the page
parameter to crawl successive pages of data.
响应是JSON,为此提供了大量库.
The response is JSON, for which there are a ton of libraries available.