原文地址:http://blog.csdn.net/yueguanghaidao/article/details/11994355
由于平时主要用Python编写Web漏洞规则脚本,所以网络方面的库使用较多,如urllib,urllib2,httplib2等,我们知道urllib这几个库都是基于httplib库开发的,
那么她们有什么区别呢?只能通过源码去分析,想看源码,那么基础库httplib得熟悉。
刚看一会就学到了一个关于Exception的知识点,源码如下:
-
class HTTPException(Exception):
-
# Subclasses that define an __init__ must call Exception.__init__
-
# or define self.args. Otherwise, str() will fail.
-
pass
-
-
class IncompleteRead(HTTPException):
-
def __init__(self, partial):
-
self.args = partial,
-
self.partial = partial
我们看到注释中说,Exception的子类如果定义__init__的话,必须调用Exception的__init__方法或者定义self.args成员变量,否则str函数调用会失败。那这到底是什么
意思呢?
从上面我们可以推断出,args是Exception的成员函数,且应该是元祖(args,kwargs你懂的),所以果断测试如下:

果然如此,所以可以肯定的是Exception的__str__函数就是输出self.args变量。
到这里,上面那句话就很明显了,不过也可以自定义__str__方法,就不用守这个规矩了。
httplib模块有几个重要的类
HTTP : 是对HTTPConnection,HTTPResponse等的封装,具有更加人性化的接口,可以看一下使用实例
-
from httplib import *
-
host = 'www.python.org'
-
selector = '/'
-
h = HTTP()
-
h.set_debuglevel(1)
-
h.connect(host)
-
h.putrequest('GET', selector)
-
h.putheader('User-Agent','skycrab')
-
h.endheaders()
-
status, reason, headers = h.getreply()
代码很简单,注意set_debuglevel大于0时将启用调试模式,也就是输出链接信息,发送头信息等。
注意不管你有没有发送头部信息,都需要调用endheaders以说明可以发送请求了,不然会有异常发生。额,有点麻烦是吧,其实HTTPConnection提供了一个自动发送
头部信息的方法,但HTTP并没有封装(不可用)。
-
from httplib import *
-
host = 'www.python.org'
-
selector = '/'
-
h = HTTPConnection(host)
-
h.set_debuglevel(1)
-
h.request('GET', selector)
-
r=h.getresponse()
-
print r.getheaders(),r.status
可以看到request是不需要endheaders的。
httplib默认是不发送User-Agent等头部的,且accept-encoding只支持identity,也就是非压缩的格式(但支持chunked),而urllib,httplib2等一般都是以她们的版本作为User-Agent,如果有需要可以直接修改源码,不然很容易被防范。
如urllib
-
version = "Python-urllib/%s" % __version__
-
-
# Constructor
-
def __init__(self, proxies=None, **x509):
-
if proxies is None:
-
proxies = getproxies()
-
assert hasattr(proxies, 'has_key'), "proxies must be a mapping"
-
self.proxies = proxies
-
self.key_file = x509.get('key_file')
-
self.cert_file = x509.get('cert_file')
-
self.addheaders = [('User-Agent', self.version)]
-
self.__tempfiles = []
我们看一下HTTPConnection的connect方法,可能和你想的有点不一样。
-
def connect(self):
-
"""Connect to the host and port specified in __init__."""
-
msg = "getaddrinfo returns an empty list"
-
for res in socket.getaddrinfo(self.host, self.port, 0,
-
socket.SOCK_STREAM):
-
af, socktype, proto, canonname, sa = res
-
try:
-
self.sock = socket.socket(af, socktype, proto)
我们看到其中使用了循环,这主要是因为一个站点的主机可能不止一个服务器,如下面所示,

HTTPConnction是用HTTPResponse类来读取返回数据的,在getresponse方法中,
-
response = self.response_class(self.sock, strict=self.strict,
-
method=self._method)
-
response.begin()
begin将开始收集返回数据,我们看一个最开始的status是如何得到的?
-
self.fp = sock.makefile('rb', 0)
-
line = self.fp.readline()
-
[version, status, reason] = line.split(None, 2)
这是被我简化的代码,上面的sock就是前面创建respnse的self.sock,sock.makefile将返回一个关联此socket描述符的普通文件描述符,这样我们就可以调用文件的read,readline等函数。之所以这样做是因为返回的都是一行一行的头信息,直接readline()很简单。
当读取返回内容时,将先通过返回头的transfer-encoding判断是否是chunkend编码,如果是将启用chunked读取方式,否则更加读取长度变量读取内容。
-
def read(self, amt=None):
-
if self.chunked:
-
return self._read_chunked(amt)
-
-
if amt is None:
-
if self.length is None:
-
s = self.fp.read()
-
else:
-
s = self._safe_read(self.length)
-
self.length = 0
-
self.close() # we read everything
-
return s
self.length是返回头部的content-length,你会好奇为什么读取固定长度不调用self.read(self.length)呢?主要是有些特殊情况,如头信息的content-length比返回的长度要长,或者被信号给中断了,前面不用_safe_read,主要是不知道读多少,不好控制,所以直接全读。
这里有必要看一下_safe_read方法。
-
def _safe_read(self, amt):
-
"""Read the number of bytes requested, compensating for partial reads.
-
-
Normally, we have a blocking socket, but a read() can be interrupted
-
by a signal (resulting in a partial read).
-
-
Note that we cannot distinguish between EOF and an interrupt when zero
-
bytes have been read. IncompleteRead() will be raised in this
-
situation.
-
-
This function should be used when bytes "should" be present for
-
reading. If the bytes are truly not available (due to EOF), then the
-
IncompleteRead exception can be used to detect the problem.
-
"""
-
s = []
-
while amt > 0:
-
chunk = self.fp.read(min(amt, MAXAMOUNT))
-
if not chunk:
-
raise IncompleteRead(s)
-
s.append(chunk)
-
amt -= len(chunk)
-
return ''.join(s)
这里说一下IncompleteRead这个异常,前几天碰到了一个被我们公司WAF防护的网站,由于触发了自定义规则,导致返回content-length大于实际页面内容,
报IncompleteRead异常。
解决方法如下:
1.修改_safe_read方法,将抛出异常换为返回数据
-
s = []
-
while amt > 0:
-
chunk = self.fp.read(min(amt, MAXAMOUNT))
-
if not chunk:
-
#raise IncompleteRead(s)
-
return ''.join(s)
-
s.append(chunk)
-
amt -= len(chunk)
-
return ''.join(s)
2.使用装饰器打patch
-
def patch_http_response_read(func):
-
def inner(*args):
-
try:
-
return func(*args)
-
except IncompleteRead, e:
-
return ''.join(e.partial)
-
return inner
-
HTTPResponse.read = patch_http_response_read(HTTPResponse.read)
将上面代码放到httplib的最后就可以了,注意IncompletedRead的partial参数是list变量,因为_safe_read中的s是[]。
个人推荐第二种方法。
HTTPConnection是具有一些列状态的,看下面代码,
-
import httplib
-
reload(httplib)
-
import time
-
httplib.HTTPConnection.debuglevel=0
-
c=httplib.HTTPConnection('www.baidu.com')
-
c.putrequest('GET','/')
-
c.endheaders()
-
r1=c.getresponse()
-
#c1=r1.read()
-
print r1.getheaders()
-
time.sleep(3)
-
c.putrequest('GET','/zhidao')
-
c.endheaders()
-
r2=c.getresponse()
-
print r2.getheaders()
将会报异常,异常信息如下:
-
Traceback (most recent call last):
-
File "C:\Python\Python25\Lib\SITE-P~1\PYTHON~1\pywin\framework\scriptutils.py", line 310, in RunScript
-
exec codeObject in __main__.__dict__
-
File "C:\Users\user\Desktop\py\cs.py", line 14, in
-
r2=c.getresponse()
-
File "C:\Python\Python25\lib\httplib.py", line 918, in getresponse
-
raise ResponseNotReady()
-
ResponseNotReady
将c1.read()注释去掉就正常了,这里需要主要前一个响应必须读取后一个请求才可以发送,源码中是这么说的,
# if a prior response exists, then it must be
completed (otherwise, we
# cannot read this response's header to
determine the connection-close
# behavior)
#
# note: if a prior response existed, but was
connection-close, then the
# socket and response were made independent of
this HTTPConnection
# object since a new request requires that we
open a whole new
# connection
#
# this means the prior response had one of two
states:
# 1) will_close: this
connection was reset and the prior socket and
#
response operate independently
# 2) persistent: the response
was retained and we await its
#
isclosed() status to become true.
#
主要原因是如果前一个响应没有读取,就不知道服务器端是否关闭连接了(connection头信息)。如果服务器关闭了连接,那么我们就需要重新连接。HTTPConnection是通过auto_open这个类变量来控制的,如果为1,默认重新连接。
(PS:源码果然不是那么好看的,花了好久才写完这篇文章,但收获那是杠杠的,强烈建议童鞋们多看看)