TOR 上的 Python urllib?
示例代码:
#!/usr/bin/python
import socks
import socket
import urllib2
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket
print urllib2.urlopen("http://almien.co.uk/m/tools/net/ip/").read()
TOR 在端口 9050(默认设置)上运行 SOCKS 代理.该请求通过 TOR,出现在我自己以外的 IP 地址上.但是,TOR 控制台给出了警告:
TOR is running a SOCKS proxy on port 9050 (its default). The request goes through TOR, surfacing at an IP address other than my own. However, TOR console gives the warning:
"Feb 28 22:44:26.233 [警告] 你的应用程序(使用socks4到端口80)只给 Tor 一个 IP 地址.进行 DNS 解析的应用程序自己可能会泄露信息.考虑使用 Socks4A(例如通过privoxy 或 socat)代替.更多信息,请看https://wiki.torproject.org/TheOnionRouter/TorFAQ#SOCKSAndDNS."
"Feb 28 22:44:26.233 [warn] Your application (using socks4 to port 80) is giving Tor only an IP address. Applications that do DNS resolves themselves may leak information. Consider using Socks4A (e.g. via privoxy or socat) instead. For more information, please see https://wiki.torproject.org/TheOnionRouter/TorFAQ#SOCKSAndDNS."
即DNS 查找不通过代理.但这就是 setdefaultproxy 的第四个参数应该做的,对吗?
i.e. DNS lookups aren't going through the proxy. But that's what the 4th parameter to setdefaultproxy is supposed to do, right?
来自http://socksipy.sourceforge.net/readme.txt:
setproxy(proxytype, addr[, port[, rdns[, username[, password]]]])
setproxy(proxytype, addr[, port[, rdns[, username[, password]]]])
rdns - 这是一个布尔标志,而不是修改有关 DNS 的行为解决.如果设置为 True,DNS解决将远程执行,在服务器上.
rdns - This is a boolean flag than modifies the behavior regarding DNS resolving. If it is set to True, DNS resolving will be preformed remotely, on the server.
选择 PROXY_TYPE_SOCKS4 和 PROXY_TYPE_SOCKS5 时效果相同.
Same effect with both PROXY_TYPE_SOCKS4 and PROXY_TYPE_SOCKS5 selected.
它不能是本地 DNS 缓存(如果 urllib2 甚至支持它),因为当我将 URL 更改为这台计算机以前从未访问过的域时,就会发生这种情况.
It can't be a local DNS cache (if urllib2 even supports that) because it happens when I change the URL to a domain that this computer has never visited before.
问题在于 httplib.HTTPConnection
使用了 socket
模块的 create_connection
辅助函数,它通过通常的 getaddrinfo
执行 DNS 请求连接socket之前的方法.
The problem is that httplib.HTTPConnection
uses the socket
module's create_connection
helper function which does the DNS request via the usual getaddrinfo
method before connecting the socket.
解决方案是创建自己的create_connection
函数并在导入urllib2
之前将其猴子补丁到socket
模块中,就像我们所做的一样使用 socket
类.
The solution is to make your own create_connection
function and monkey-patch it into the socket
module before importing urllib2
, just like we do with the socket
class.
import socks
import socket
def create_connection(address, timeout=None, source_address=None):
sock = socks.socksocket()
sock.connect(address)
return sock
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection
import urllib2
# Now you can go ahead and scrape those shady darknet .onion sites