记录Linux中任何可执行文件/进程的所有内存访问
我一直在寻找一种方法来记录Linux中进程/执行的所有内存访问.我知道以前在这里曾对此主题提出过这样的问题
I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this
但是我想知道是否有执行此活动的非仪器工具.我不为此目的寻找QEMU/VALGRIND ,因为它会有点慢,并且我希望尽可能少的开销.
But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.
为此,我查看了perf mem
和PEBS事件(如cpu/mem-loads/pp
),但我发现它们将仅收集采样数据,而我实际上希望在不进行任何采样的情况下跟踪所有内存访问.
I looked at perf mem
and PEBS events like cpu/mem-loads/pp
for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.
我想知道是否有可能通过使用QEMU之类的工具来收集所有内存访问而不会浪费太多开销.是否有可能仅使用PERF但不使用样本,以便获得所有内存访问数据?
I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?
我还缺少其他工具吗?还是任何其他可以给我所有内存访问数据的策略?
Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?
同时运行最快的Spec和在此运行中跟踪的所有内存访问(或缓存未命中)都是不可能的.使用系统内跟踪程序).进行一次运行以进行计时,然后进行另一次运行(更长,更慢),甚至重新编译二进制文件以进行内存访问跟踪.
It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.
您可以从简短的程序(不是最近的SpecCPU的引用输入,或大型程序中的十亿个mem访问)开始,然后使用perf
linux工具 (perf_events)查找记录的内存请求与所有内存请求的可接受比率.有perf mem
工具,或者您可以尝试某些启用PEBS的内存子系统事件.通过在性能事件说明符perf record -e event:pp
中添加:p
和:pp
后缀来启用PEBS,其中event是PEBS事件之一.也可以尝试 pmu-tools ocperf.py 来简化英特尔事件名称编码并查找启用PEBS的事件.
You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf
linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem
tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p
and :pp
suffix to the perf event specifier perf record -e event:pp
, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.
尝试在内存性能测试中找到具有不同记录比率(1%/10%/50%)的实际(最大)开销.在 https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/.这部分的典型测试是:STREAM(BLAS1),RandomAccess(GUPS)和memlat几乎都是SpMV;很多真正的任务通常不会那么累人:
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:
- STREAM测试(对内存的线性访问),
- RandomAccess(GUPS)测试
- 一些内存延迟测试( 7z的示例, lmbench ).
- STREAM test (linear access to memory),
- RandomAccess (GUPS) test
- some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
您是要跟踪每个加载/存储命令,还是只想记录丢失所有(某些)缓存并发送到PC的主RAM内存(到L3)的请求?
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
为什么不希望有开销并且记录了所有内存访问?这是不可能的,因为每个内存访问都必须跟踪几个字节(内存地址,有时是:指令地址)才能记录到同一内存中.因此,启用内存跟踪(大于10%或内存访问跟踪)将明显限制可用内存带宽,并且程序运行速度会变慢.甚至可以跟踪到1%的跟踪,但是其影响(开销)较小.
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
您的CPU E5-2620 v4是Broadwell-EP 14nm,因此它可能也具有Intel PT的某些较早版本: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org /processor-trace ,尤其是Andi Kleen在 pt 上的博客: http://halobates.de/blog/p/410 用于Linux的英特尔处理器跟踪的速查表性能和gdb"
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
硬件中的PT支持:Broadwell(第5代Core,Xeon v4),开销更大.没有细粒度的时机.
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS:研究SpecCPU的内存访问的学者使用内存访问转储/跟踪,并且转储的生成速度很慢:
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
- http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf -LLC未记录到脱机分析中,没有从跟踪运行中记录任何时间
- http://users.ece. utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf -通过写入其他巨大的跟踪缓冲区以进行定期(稀有)在线聚合来检测所有负载/存储.这样的检测速度是2倍或更慢,特别是对于内存带宽/延迟受限的内核.
- http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (由VSSAD的Intel Corporation的Aamer Jaleel提供)-基于引脚的检测-程序代码已被修改并检测为将内存访问元数据写入缓冲区.这样的检测速度是2倍或更慢,特别是对于内存带宽/等待时间受限的内核.该文件列出并解释了仪器开销和注意事项:
- http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
- http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
- http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
仪器开销:仪器涉及 动态或静态地将额外的代码注入到 目标应用程序.附加代码会导致 应用程序花费额外的时间执行原始文件 应用程序...此外,用于多线程 应用程序,仪器可以修改命令的顺序 在不同线程之间执行的指令 应用.结果,IDS与多线程 应用程序缺乏保真度
Instrumentation Overhead: Instrumentation involves injecting extra code dynamically or statically into the target application. The additional code causes an application to spend extra time in executing the original application ... Additionally, for multi-threaded applications, instrumentation can modify the ordering of instructions executed between different threads of the application. As a result, IDS with multi-threaded applications comes at the lack of some fidelity
缺乏投机:只能观察仪器 在正确的执行路径上执行的指令.作为 结果,IDS可能无法支持错误路径...
Lack of Speculation: Instrumentation only observes instructions executed on the correct path of execution. As a result, IDS may not be able to support wrong-path ...
仅限用户级流量:当前的二进制工具 工具仅支持用户级别的工具.因此, 内核密集型应用程序不适合 用户级别的IDS.
User-level Traffic Only: Current binary instrumentation tools only support user-level instrumentation. Thus, applications that are kernel intensive are unsuitable for user-level IDS.