debug实战:Unmanaged High Memory非托管高内存

最近又监控到一个高内存的问题,周五下班把系统打开,周末2天没关,周一来看已经涨到5.2G,这次与以往不同,不是.net的内存泄漏,而是非托管引起的。

1. 抓dump,确定高内存的类型

//dump有5.2G,.net的堆只有191M,可以判断是unmanaged high memory
0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000159852b88
generation 1 starts at 0x00000001597d26a0
generation 2 starts at 0x0000000003231000
ephemeral segment allocation context: none
 segment     begin allocated  size
0000000003230000  0000000003231000  000000000dad5f38  0xa8a4f38(176836408)
00000001596b0000  00000001596b1000  0000000159db48f8  0x7038f8(7354616)
Large object heap starts at 0x0000000013231000
 segment     begin allocated  size
0000000013230000  0000000013231000  0000000013906b30  0x6d5b30(7166768)
Total Size:              Size: 0xb67e360 (191357792) bytes.
------------------------------
GC Heap Size:            Size: 0xb67e360 (191357792) bytes.

2. 用windbg找原因

于是不能用sos ext里的命令了,用CodeProject上的这篇文章里的方法试了一下,如下:

//这个命令能看到所有的内存消耗,确实分配了5G多
0:000> !address -summary	                                     
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for DebussyReverie.dll - 
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free                                    412      7fe`897c2000 (   7.994 Tb)           99.93%
Heap                                    628        1`24c81000 (   4.575 Gb)  78.18%    0.06%
<unknown>                              1217        0`39381000 ( 915.504 Mb)  15.28%    0.01%
Image                                  2120        0`14561000 ( 325.379 Mb)   5.43%    0.00%
Stack                                   200        0`04080000 (  64.500 Mb)   1.08%    0.00%
Other                                    13        0`001c6000 (   1.773 Mb)   0.03%    0.00%
TEB                                      66        0`00084000 ( 528.000 kb)   0.01%    0.00%
PEB                                       1        0`00001000 (   4.000 kb)   0.00%    0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE                            1509        1`59760000 (   5.398 Gb)  92.24%    0.07%
MEM_IMAGE                              2683        0`184d0000 ( 388.813 Mb)   6.49%    0.00%
MEM_MAPPED                               53        0`04bfe000 (  75.992 Mb)   1.27%    0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE                                412      7fe`897c2000 (   7.994 Tb)           99.93%
MEM_COMMIT                             3267        1`4d29f000 (   5.206 Gb)  88.96%    0.06%
MEM_RESERVE                             978        0`2958f000 ( 661.559 Mb)  11.04%    0.01%

//再查看非托管堆,能看到有1个堆占了4.5G
0:000> !heap -s
LFH Key                   : 0x000000ff3fbc76ee
Termination on corruption : ENABLED
          Heap     Flags   Reserv  Commit  Virt   Free  List   UCR  Virt  Lock  Fast 
                            (k)     (k)    (k)     (k) length      blocks cont. heap 
-------------------------------------------------------------------------------------
Virtual block: 0000000002ca0000 - 0000000002ca0000 (size 0000000000000000)
...
000000001fc30000 00001002 4597824 4568704 4597824    568   296   288    0      1   LFH
...
    External fragmentation  23 % (59 free blocks)
-------------------------------------------------------------------------------------

//查看有问题的堆,大小为27d1的某个内存块共有6cc72个,27d1*6cc72=10eb24712(4.5G)
0:000> !heap -stat -h 000000001fc30000 
 heap @ 000000001fc30000
group-by: TOTSIZE max-display: 20
    size     #blocks     total     ( %) (percent of total busy bytes)
    27d1 6cc72 - 10eb24712  (99.95)
    128 18d5 - 1cb648  (0.04)
    2740 11 - 29b40  (0.00)
    3e80 1 - 3e80  (0.00)

//用size=27d1过滤,查看每个对象的指针UserPtr
0:000> !heap -flt s 27d1
    _HEAP @ 90000
	...
    _HEAP @ 1fc30000
              HEAP_ENTRY Size Prev Flags            UserPtr UserSize - state
        0000000025217420 0281 0000  [00]   0000000025217430    027d1 - (busy)
        0000000025219c30 0281 0281  [00]   0000000025219c40    027d1 - (busy)

//随便找个对象地址,看被哪里引用了,悲剧的是,不管找哪个,都没有堆栈信息stack trace!
0:000> !heap -p -a 0000000025217430    
    address 0000000025217430 found in
    _HEAP @ 1fc30000
              HEAP_ENTRY Size Prev Flags            UserPtr UserSize - state
        0000000025217420 0281 0000  [00]   0000000025217430    027d1 - (busy)

3. 用DebugDiag找原因

上面的路目前看起来是走不大通了,于是换Debug大神-美女程序猿-Tess的这篇博文里的方法,改用DebugDiag试试。

  1. 下载DebugDiag1.2,这是目前最新的版本,已经能支持win7和x64位的系统了。根据你的os选择x86/x64的msi版本并安装。
  2. 重新启动你的程序,因为这个工具是在跟踪tracking内存分配的过程,所以需要在high memory出现之前就开始监控。
  3. 打开DebugDiag,选择Add Rule、Native (non-.NET) Memory and Handle,选择要分析的进程,Rule->Status会显示目前处于tracking的状态。
  4. 执行一段时间、1h 或 1天,转到tab页Processes里右键进程、Create Full Userdump。
  5. 然后转到Advanced Analysis里选中"Memory Pressure Analyzers"脚本,点击Add Data Files把生成的dump加进去。
  6. 点击"Start Analysis"就开始分析了,要耐心一点,视dump大小一般需要>10min。

因为要重新启动,下面的dump是我重新抓的,跑了一晚上,大概有1.7G,分析的结果显示是一个DebussyReverie.dll的模块在不停的分配内存。

DebussyReverie.dll is responsible for 1.15 GBytes worth of outstanding allocations. The following are the top 2 memory consuming functions:	

DebussyReverie!GetLspGuid+79e: 1.07 GBytes worth of outstanding allocations.	
DebussyReverie!GetLspGuid+f31df: 83.04 MBytes worth of outstanding allocations.

Image Name: C:Program FilesLeagSoftUniAccess AgentDebussyReverie.dll 
Function   DebussyReverie!GetLspGuid+79e 
Allocation type   Heap allocation(s) 
Heap handle   0x00000000`00000000 
Allocation Count   114716 allocation(s) 
Allocation Size   1.07 GBytes 
Leak Probability   81% 

//堆栈如下,可惜没能显示从我的代码到问题代码之间的调用关系
Function   Source   Destination 
LeakTrack+13277       
DebussyReverie!GetLspGuid+79e       
DebussyReverie!GetLspGuid+647       
DebussyReverie!GetLspGuid+ea3       
DebussyReverie!WSPStartup+431       
ws2_32!WahOpenApcHelper+20a       
ws2_32!_WSAFDIsSet+e02       
ws2_32!WSCInstallProviderAndChains64_32+1971       
ws2_32!WSCInstallProviderAndChains64_32+1aa0       
ws2_32!WSASendTo+21fc       
ws2_32!getsockname+4a       
DebussyReverie!PublicService+74b       
DebussyReverie!GetLspGuid+a41dd       
DebussyReverie!PublicService+2c70       
ws2_32!WSApSetPostRoutine+298       
ws2_32!WSAConnect+af       
System_ni+32f3f2       
...     
0x7FE8925FEA7 

看起来,问题来自公司安装的网络监控软件LeagSoft里的DebussyReverie.dll。大约是下班后服务器断开了,项目里Apache.NMS.ActiveMQ连接出错时调用多线程反复重连,然后被监控软件拦截到后出的错。应该是LeagSoft里的一个Bug,如果能和他们交流一下,应该能进一步确定问题所在,甚至修复这个问题。