gethostbyname取得的IP地址不能用来访问，403Forbidden

gethostbyname获得的IP地址不能用来访问，403Forbidden
用socket做个爬虫，用域名加路径可以访问，但将域名替换为gethostbyname获取到的IP地址，就不能访问了，是不是网站做了什么安全措施？既然浏览器可以访问，有什么办法让爬虫也能访问？
------解决思路----------------------
需要定义Host字段
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

引用

14.23 Host

The Host request-header field specifies the Internet host and port number of the resource being requested, as obtained from the original URI given by the user or referring resource (generally an HTTP URL,

as described in section 3.2.2). The Host field value MUST represent the naming authority of the origin server or gateway given by the original URL. This allows the origin server or gateway to differentiate between internally-ambiguous URLs, such as the root "/" URL of a server for multiple host names on a single IP address.

       Host = "Host" ":" host [ ":" port ] ; Section 3.2.2

A "host" without any trailing port information implies the default port for the service requested (e.g., "80" for an HTTP URL). For example, a request on the origin server for <http://www.w3.org/pub/WWW/> would properly include:

       GET /pub/WWW/ HTTP/1.1
       Host: www.w3.org

A client MUST include a Host header field in all HTTP/1.1 request messages . If the requested URI does not include an Internet host name for the service being requested, then the Host header field MUST be given with an empty value. An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain an appropriate Host header field that identifies the service being requested by the proxy. All Internet-based HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message which lacks a Host header field.

See sections 5.2 and 19.6.1.1 for other requirements relating to Host.

------解决思路----------------------
楼上正解，如果你用IP访问，需要在HTTP协议头部加上Host字段行就可以了

gethostbyname取得的IP地址不能用来访问，403Forbidden

相关推荐