垃圾收集器如何找到有关从堆栈完成的对象引用的信息?

问题描述:

In languages with automatic garbage collection like Haskell or Go, how can the garbage collector find out which values stored on the stack are pointers to memory and which are just numbers? If the garbage collector just scans the stack and assumes all addresses to be references to objects, a lot of objects might get incorrectly marked as reachable.

Obviously, one could add a value to the top of each stack frame that described how many of the next values are pointers, but wouldn't that cost a lot of performance?

How is it done in reality?

在具有自动垃圾收集的语言(例如Haskell或Go)中,垃圾收集器如何才能找出存储在堆栈中的值 是指向内存的指针,而仅仅是数字? 如果垃圾收集器仅扫描堆栈并假定所有地址都是对对象的引用,则很多对象可能会被错误地标记为可访问。 p>

很显然,可以在顶部添加一个值 每个堆栈帧中描述了多少个下一个值是指针,但是这不会花费很多性能吗? p>

在现实中如何实现? p> DIV>

There exist GCs that assume that every bit pattern that is the address of something the GC is managing is in fact a pointer (and so don't release the something). This can actually work pretty well, because calls pointers are usually bigger than small common integers, and usually have to be aligned. But yes, this can cause collection of some objects to be delayed. The Boehm collector for C works this way, because it's library-based and so don't get any specific help from the compiler.

There are also GCs that are more tightly coupled to the language they're used in, and actually know the structure of the objects in memory. I've never read up specifically in stack frame handling, but you could record information to help the GC if the compiler and GC are designed to work together. One trick would be putting all the pointer references together and using one word per stack frame to record how many there are, which is not such a huge overhead. If you can work out what function corresponds to each stack frame without adding a word saying so, then you could have a per-function "stack frame layout map" compiled in. Another option would be to use tagged words, where you set the low order bit of words that are not pointers to 1, which (due to address alignment) is never needed for pointers, so you can tell them apart. That means you have to shift unboxed values in order to use them though.

The Haskell stack uses a single word of memory in each stack frame describing (with a bitmap) which of the values in that stack frame are pointers and which are not. For details, see the "Layout of the stack" article and the "Bitmap layout" article from the GHC Commentary.

To be fair, a single word of memory really isn't much cost, all things considered. You can think of it as just adding a single variable to each method; that's not all that bad.

Some collectors assume everything on the stack is a potential pointer (like Boehm GC). This turns out to be not as bad as one might expect, but is clearly suboptimal. More often in managed languages, some extra tagging information is left with the stack to help the collector figure out where the pointers are.

Remember that in most compiled languages, the layout of a stack frame is the same every time you enter a function, therefore it is not that hard to ensure that you tag your data in the right way.

The "bitmap" approach is one way of doing this. Each bit of the bitmap corresponds to one word on the stack. If the bit is a 1 then the location on the stack is a pointer, and if it is a 0 then the location is just a number from the point of view of the collector (or something along those lines). The exceptionally well written GHC runtime and calling conventions use a one word layout for most functions, such that a few bits communicate the size of the stack frame, with the rest serving as the bitmap. Larger stack frames need a multi word structure, but the idea is the same.

The point is that the overhead is low, since the layout information is computed at compile time, and then included in the stack every time a function is called.

An even simpler approach is "pointer first", where all the pointers are located at the beginning of the stack. You only need to include a length prior to the pointers, or a special "end" word after them, to tell which words are pointers given this layout.

Interestingly, trying to get this management information on to the stack produces a host of problem related to interop with C. For example, it is sub optimal to compile high level languages to C, since even though C is portable, it is hard to carry this kind of information. Optimizing compilers designed for C like languages (GCC,LLVM) may restructure the stack frame, producing problems, so the GHC LLVM backend uses its own "stack" rather than the LLVM stack which costs it some optimizations. Similarly, the boundary between C code, and "managed" code needs to be constructed carefully to keep from confusing the GC.

For this reason, when you create a new thread on the JVM you actually create two stacks (one for Java, one for C).

It's important to realize that GHC maintains its own stack and does not use the C stack (other than for FFI calls). There's no portable way to access all of the contents of the C stack (for instance, in a SPARC some of it is hidden away in register windows), so GHC maintains a stack where it has full control. Once you maintain your own stack you can pick any scheme to distinguish pointers from non-pointers on the stack (like a using a bitmap).