使用binascii,zlib,struct和numpy的Python内存泄漏
我有一个python脚本,该脚本正在处理来自压缩ASCII的大量数据.短时间后,它将耗尽内存.我不是在构建大型列表或字典.以下代码说明了该问题:
I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:
import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc
process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()
print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
它打印:
Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB
在第一次和第二次迭代之间,创建了6 MB的内存.如果我运行循环两次以上,则内存使用量将保持在51 MB.如果我将要解压缩的代码放入其自己的函数中并提供实际的压缩数据,则内存使用量将继续增长.我正在使用Python 2.7.为什么内存增加,如何纠正?谢谢.
Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.
通过评论,我们弄清楚了正在发生的事情:
Through comments, we figured out what was going on:
主要问题是,在for
循环中声明的变量不会在循环结束后销毁.它们仍然可访问,指向上一次迭代中获得的值:
The main issue is that variables declared in a for
loop are not destroyed once the loop ends. They remain accessible, pointing to the value they received in the last iteration:
>>> for i in range(5):
... a=i
...
>>> print a
4
所以这是正在发生的事情
So here's what's happening:
- 第一次迭代:
print
显示45MB,这是内存之前实例化byte_array
和a
的地方. - 代码实例化了这两个冗长的变量,使内存达到51MB
- 第二次迭代:在循环的第一次运行中实例化的两个变量仍然存在.
- 在第二次迭代的中间,新实例化将覆盖
byte_array
和a
.最初的变量被销毁,但由同样冗长的变量代替. -
for
循环结束,但是在代码中仍可以访问byte_array
和a
,因此不会被第二个gc.collect()
调用破坏.
- First iteration: The
print
is showing 45MB, which the memory before instantiatingbyte_array
anda
. - The code instantiates those two lengthy variables, making the memory go to 51MB
- Second iteration: The two variables instantiated in the first run of the loop are still there.
- In the middle of the second iteration,
byte_array
anda
are overwritten by the new instantiation. The initial ones are destroyed, but substituted by equally lengthy variables. - The
for
loop ends, butbyte_array
anda
are still accessible in the code, therefore, not destroyed by the secondgc.collect()
call.
将代码更改为:
for i in xrange(2):
[ . . . ]
byte_array = None
a = None
gc.collect()
使通过byte_array
和a
重新存储的内存无法访问,因此被释放.
made the memory resreved by byte_array
and a
unaccessible, and therefore, freed.
此SO答案中有关于Python的垃圾回收的更多信息: https://stackoverflow.com/a/4484312/289011
There's more on Python's garbage collection in this SO answer: https://stackoverflow.com/a/4484312/289011
此外,可能值得一看如何确定Python中对象的大小?.但是,这很棘手……如果您的对象是指向其他对象的列表,则 size 是什么?列表中指针的总和?这些指针指向的对象的大小总和?
Also, it may be worth looking at How do I determine the size of an object in Python?. This is tricky, though... if your object is a list pointing to other objects, what is the size? The sum of the pointers in the list? The sum of the size of the objects those pointers point to?