使AJAX应用程序抓取?如何在谷歌应用程序引擎构建一个简单的Web服务来生成HTML快照?

使AJAX应用程序抓取?如何在谷歌应用程序引擎构建一个简单的Web服务来生成HTML快照?

问题描述:

真实世界的问题:

我有我的应用程序托管在 Heroku的,谁(据我所知)无法提供一个解决方案,用于运行一个无头(无图形界面的)浏览器 - 如 的HtmlUnit - 生成的 HTML快照以Googlebot的索引我的AJAX的内容。

I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.

我提出的解决方案:

如果您还没有准备好,我建议您阅读谷歌的完整规范制作Ajax应用程序抓取

想象一下,我有:

  • 在托管在Heroku上的域西纳特拉应用 http://example.com
  • 应用程序具有沿页面顶部的标签塔巴,塔布和TABC
  • 在每个选项卡下的 SubTab1,SubTab2,SubTab3
  • 的onload如果URL是的http://example.com#选项卡=塔巴和放大器;子选项卡= SubTab3 然后客户端Javascript取的location.hash 和负载在塔巴,通过AJAX SubTab3内容。
  • a Sinatra app hosted on Heroku on the domain http://example.com
  • the app has tabs along the top of the page TabA, TabB and TabC
  • under each tab is SubTab1, SubTab2, SubTab3
  • onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.

注:哈希邦(#!)是的一部分谷歌规范

我想建立一个简单的Web服务托管在谷歌的App Engine (GAE)说:

I would like to build a simple "web service" hosted on Google App Engine (GAE) that:

  1. 在接受一个URL参数如 http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (URL参数应该是URLEn codeD)
  2. 在运行的HtmlUnit打开 HTTP:!//example.com#标签=塔巴和放大器;子选项卡= SubTab3 并运行客户端JavaScript上断绝李>
  3. 的HtmlUnit返回DOM一旦样样齐全(或类似45秒已通过)。
  4. 在返回的内容可以通过JSON / JSONP发回,或者一个URL是返回生成并存储在谷歌应用程序引擎服务器(基于文件的缓存的结果)上的文件......开放的建议在这里。如果返回一个URL到文件,然后你可以卷曲获得源$ C ​​$ C(又名一个HTML快照)。
  1. Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
  2. Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
  3. HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
  4. The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).

我的 http://example.com 应用程序将需要管理的呼叫 http://htmlsnapshot.appspot.com ...基本上是:

My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:

  1. 在抓Googlebots来电 http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (Googlebot抓取工具逃脱某些字符,比如%26 =安培; )。
  2. 发送来自后端请求 http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (URL参数应该是URLEn codeD)
  3. 渲染返回的HTML快照给前端。
  4. 谷歌索引的内容和我们欢呼吧!
  1. Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
  2. Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
  3. Render the returned HTML Snapshot to the frontend.
  4. Google Indexes the content and we rejoice!

我没有与谷歌应用程序引擎或Java或任何的HtmlUnit经验。

I don't have any experience with Google App Engine or Java or HTMLUnit.

我也许能推测出来......并且将在我的结果,如果我做的。

I might be able to figure it out... and will post my results if I do.

否则,我觉得这是一个非常好的机会人写一个踢屁股的博客文章,概述新手一步一步的指导,以建立一个Web服务是这样的。

Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.

这将引入更多的人,以优良的(和免费!)谷歌应用程序引擎。它也将undoubtably鼓励更多的人采用谷歌的规格进行抓取AJAX内容......这是我们可以从所有的利益!

This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!

随着谷歌的规格涨幅更接受建立一个无头的浏览器的障碍将发送许多的开发者谷歌搜索的答案!现在获取与名利和荣耀的答案! (编辑:最起码我会唱你的赞美)。

As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).

打我的Twitter @_ chrisjacob 如果你想讨论的解决方案。

Hit me up on twitter @_chrisjacob if you would like to discuss solutions.

我已经成功地使用上的HtmlUnit AppEngine上。我的GWT code要做到这一点,在 GWT平台项目我得到的结果是相似的在的HtmlUnit-AppEngine上测试应用程序通过阿米特Manjhi。

I have successfully used HTMLunit on AppEngine. My GWT code to do this is available in the gwt-platform project the results I got were similar to that of the HTMLunit-AppEngine test application by Amit Manjhi.

这应该是比较容易使用GWTP目前支持的HtmlUnit做的正是你的描述,尽管你可能会做一个简单的应用程序。一个问题,我看到的是,AppEngine上请求有30秒的超时时间,所以你不能有一个网页,花费的HtmlUnit长于加工。

It should be relatively easy to use GWTP current HTMLunit support to do exactly what you describe, although you could likely do it in a simpler app. One problem I see is that AppEngine requests have a 30 second timeout, so you can't have a page that takes HTMLunit longer than that to process.

更新: 它已经有一段时间,但最后我有关使GWT应用程序抓取使用GWTP封闭已久的问题。该文档不是完全没有,但检查出来的问题:    HTTP://$c$c.google.com / P / GWT平台/问题/详细信息?ID = 1

UPDATE: It's been a while, but I finally closed the long standing issue about making GWT applications crawlable using GWTP. The documentation is not entirely there, but check out the issue: http://code.google.com/p/gwt-platform/issues/detail?id=1