HtmlUnit和片段身份

问题描述：

我目前正在想如何处理片段身份，我想从中获取信息的链接包含片段身份.好像HtmlUnit放弃了我的URL的#/db4mj"，因此正在加载原始URL.

I'm currently wondering how to deal with fragment identities, a link that I am wanting to grab information from, contains a fragment identity. It seems as if HtmlUnit is discarding the "#/db4mj" of my url and therefore loading the original url.

有人知道处理片段身份的方法吗? (我可以发布示例代码以进一步说明是否需要)

Does anyone know of a way to deal with fragment identities? (I can post example code to further explain if need be)

编辑

由于我没有太多的观点(也没有答案)，所以我要增加一笔赏金.抱歉，只有50个，但我只有79个以

Since I wasn't getting many views (and no answers), I'm going to add a bounty. Sorry it's only 50, but I only had 79 to start with

编辑

这是所要求的示例代码.

Here is an example code as requested.

我们的URL为: http://browse. deviantart.com/resources/applications/psbrushes/?order=9&offset=0

因此，如果您查看链接中的内容，将会看到多个包含URL的画笔.因此，我的脚本获取了URL: http://Browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

So if you take a look at the content in the link, you'll see multiple brushes that contain URLs as well. So my script grabs the URL: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

您可以看到片段标识符#/dbwam4 现在，我尝试获取此页面上的内容，但HtmlUnit仍然认为它位于原始URL上.

As you can see there is the fragment identifier #/dbwam4 Now I try and grab the content that is on this page, but HtmlUnit still thinks it is on the original URL.

这是我脚本中的示例代码，其中片段标识符URL失败，但原始URL没问题.

Here is an the example code in my script where it fails on the fragment identifier URL but has no problem with the original URL.

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

我希望能够从带有片段标识符的URL中获取某些信息，但是无论如何都无法访问它.

I'm expecting to be able to grab certain information from the URL with the fragment identifier but am unable to access it whatsoever.

答

有好消息和坏消息.

首先，好消息是HtmlUnit似乎工作正常.

First the good news is that HtmlUnit appears to be working just fine.

如果您访问包含片段的页面在关闭了JavaScript的浏览器中识别URL (也许使用 Firefox的QuickJava插件)，您将看不到所需的单刷视图".

If you visit the page with the fragment identier URL in a browser with JavaScript turned off (maybe using Firefox's QuickJava plugin), you will not see the "single brush view" that you want.

因此，要获取此页面，您需要使用WebClient，并将setJavaScriptEnabled设置为true.

So in order to acquire this page you need to use WebClient with setJavaScriptEnabled set to true.

现在是坏消息:

在使用JavaScript的情况下，我一直无法使用HtmlUnit来获取单个笔刷视图"页面(我不知道为什么).不过，我可以根据需要获取整页内容.

I have not consistently been able to acquire the "single brush view" page using HtmlUnit with JavaScript turned on (I know not why). Although, I have been able to acquire the full page on occassion.

真正的问题是返回的HTML的状态非常糟糕，以至于无法解析我的尝试(我尝试 TagSoup ， jsoup ，

The real problem is the state of the returned HTML is so bad as to defy my attempts to parse it (I tried TagSoup, jsoup, Jaxen, etc). I therefore suspect attempting to parse the page using XPath may not work for you.

因此，我认为您需要求助于使用正则表达式(这远非理想)，甚至需要使用String.indexOf("gmi-ResViewSizer_img")的某些变体.

I would therefore think you need to resort to using regular expressions (which is far from ideal) or even use some variant of String.indexOf("gmi-ResViewSizer_img").

我希望这会有所帮助.

编辑

我设法得到了一些偶尔起作用的东西.恐怕我还没有转换为Groovy，所以它将使用普通的旧Java.

I managed to get something that sporadically works. I'm afraid I am not converted to Groovy yet, so it will be in plain old Java.

我还没有看过HtmlUnit的来源，但是几乎好像在运行保存过程中的某些东西正在帮助进行解析?没有保存，我似乎会得到NullPointerExceptions.

I haven't looked at the source of HtmlUnit but it is almost as if something in the process of running the save is helping to make the parsing work?? Without the save I seem to get NullPointerExceptions.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}

相关推荐