如何使用带有 wget 的 POST 参数排除链接

如何使用带有 wget 的 POST 参数排除链接

问题描述:

我想下载 www.site.com/en/下所有可访问的 html 文件.但是,网站上有很多带有帖子参数的链接 URL(例如,每个产品类别的第 1、2、3.. 页).我希望 wget 不要下载这些链接.我正在使用

I want to download all accessible html files under www.site.com/en/. However, there are a lot of linked URLS with post parameters on the site (e.g. pages 1,2,3.. for each product category). I want wget NOT to download these links. I'm using

-R "*\?*"

但它并不完美,因为它只会在下载文件后删除文件.

But it's not perfect because it only removes the file after downloading it.

有什么方法可以例如用正则表达式过滤 wget 后面的链接吗?

Is there some way for example to filter the links followed by wget with a regex?

使用正则表达式可以避免这些文件,您必须使用 --reject-regex '(.*)\?(.*)' 但它仅适用于 wget 1.15 版,因此我建议您先检查您的 wget 版本.

It is possible to avoid those files with a regex, you would have to use --reject-regex '(.*)\?(.*)' but it will work only with wget version 1.15, so I would recommend you to check your wget version first.