Sunday, 13 October 2013

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says:

Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable” – Wikipedia

Similar claims can even be seen in published research such as the paper “Down for the Count? Getting Reference Counting Back in the Ring”:

“Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” Blackburn et al.

On the other hand you can see statements by experts like Richard Jones, co-author of the excellent Garbage Collection Handbook, make statements like:

“More importantly, note also that even an immediate (i.e. non deferred) reference counter cannot reclaim objects as soon as they are no longer referenced as finalisation must be asynchronous (see Hans Boehm's POPL03 paper "Destructors, finalizers and synchronization").” – a post on the gc-list by Richard Jones.

Let’s have a closer look at the thinking behind this belief and test it with a simple program. The mental model that underpins this belief is that any function’s local variables are stored in separate slots in the function’s stack frame for the entire duration of a function’s body and, therefore, will be reachable from the point of view of the garbage collector for the duration of the call to the function. This mental model underpins exam and interview questions such as Is object eligible for garbage collection after “obj = null”? and When Is The Object Eligible For Garbage Collection?.

In reality, this mental model is simple, obvious and wrong. Why? Firstly, the garbage collector sees the run-time representation of a program after it has been subjected to transforms such as inlining, instruction reordering and code block reordering by the compiler that can mutilate the structure of a program beyond recognition and, consequently, concepts like scope that exist only in the source code and not in the compiled form are not visible to the garbage collector. Secondly, the register allocator does everything possible to keep references in registers and avoid spilling them to the stack and when they must be spilled it uses the results of liveness analysis to overwrite any dead references in the stack frame whenever possible. In fact, some compilers don’t even use stack frames, such as our own x86 JIT in F# and the HLVM project, and other compilers like SML/NJ convert every call into continuation style and put stack frames on the heap, splitting every segment of code between a pair of function calls in the source into its own separate function in the compiled form.

Enough theory, let’s take a look at some working code. Here is a simple example using tracing garbage collection in OCaml/F# where an argument tmp to a function dies in the middle of the function body and, in particular, before a recursive call:

let rec loop tmp i =
  if i<=0 then tmp else
    let tmp2 = loop (Array.copy tmp) (i-1)
    tmp2.[0] <- tmp2.[0] + 1
    tmp2

When run using loop (Array.init m id) n, this code clearly uses less than mn space and keeps on running indefinitely. This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected.

Here is the equivalent using scope-based reference counting in C++:

shared_ptr<vector<double> > loop(shared_ptr<vector<double> > tmp, int i) {
  if (i<=0) {
    return tmp;
  } else {
    shared_ptr<vector<double> > tmp1(new vector<double>(*tmp));
    shared_ptr<vector<double> > tmp2 = loop(tmp1, i-1);
    ++(*tmp2)[0];
    return tmp2;
  }
}

In contrast, this code clearly requires at least mn space when run, goes to swap and (on Windows) dies from out of memory. Unlike the OCaml/F# code, the scope-based reference counting using shared_ptr in C++ keeps the tmp array allocated for longer than necessary, right until the end of the function call.

This observation also destroys another popular memory management myth: that tracing garbage collection always requires more memory than reference counting.

If there is any advantage to the C++ then it is the presence of guarantees. The semantics of C++ guarantee that after the end of scope the object has been deleted. However, it is worth noting that this guarantee of determinism does not apply to objects shared between threads because in that situation the threads race to decrement the reference counter to zero and the winner of the race condition is burdened with executing the destructor.

12 comments:

Aaron said...

I get an OOM exception in the F# "O(1) space" code. Did you test it in F# or OCaml?

Jon Harrop said...

I tested it in F#. Make sure you don't have the debugger attached...

Unknown said...

I think your C++ example is wrong. You use an object on the stack (shared_ptr) that points to an object on the heap (std::vector) that points to a sequence of objects on the heap (doubles) to prove that stack objects are destroyed when the stack is unwound. We all know that. There is no need for the shared_ptr here, using just vector is enough.


vector<double> loop(vector<double> tmp, int i)
{
if (i <= 0)
{
return tmp;
}
else
{
vector<double> tmp2 = loop(tmp, i - 1);
++(tmp2)[0];
return tmp2;
}
}

int main()
{
auto vec = vector<double>(10*1024*1024, 0);
auto ptr = loop(vec, 2000);

return 0;
}


However, this also has the problem of making unnecessary copies and running out of memory (since memory is released when the vectors go out of scope that happens when the function call in which they were created returns).

So I think the better implementation of your example should use move semantics.

vector<double> loop(vector<double>&& tmp, int i)
{
if (i <= 0)
{
return tmp;
}
else
{
vector<double> tmp2 = loop(std::move(tmp), i - 1);
++(tmp2)[0];
return tmp2;
}
}

int main()
{
auto vec = vector<double>(10*1024*1024, 0);
auto ptr = loop(std::move(vec), 3600);

return 0;
}

This run 'indefinitely', though in practice stack overflow happens after 3600+ recursive calls (in my tests).

Marius Bancila said...

What I missed to mention in my previous comment is that translating code line by line from one language to another will not produce the expected results. You should use the appropriate paradigms and practices for each language if you want to achieve performance. You can shoot yourself in the foot with any programming language. Writing C++ code like in F# or F# code like in C++ is no good and one should avoid that.

Current said...

Jon is right that reference counting isn't "prompt" in the way that some people claim it is.

On the other hand I don't quite agree with him about his F# and C++ examples. As I understand it in most languages work like this... When a recursive call is made the context of the call is repeated. The scope that the first call occurs in is not the same as the scope that the second call occurs in. So, the "shared_ptr > tmp1" and tmp2 are different each time "loop" is called and each creates a new reference. The C++ example fails because the C++ compiler's optimizer doesn't recognize that this type of recursion can be optimized, but the F# optimizer does.

The memory exhaustion happens because of a combination of two things. Firstly, there are many scopes and many local variables that aren't removed by compiler optimization, these cause many references. Second, reference counts are only decremented at the end of each of the scopes.

Really, only the second factor is important for Jon's argument, the first is an aspect of compiler optimization. But his examples depend more on the first factor. If the F# compiler used reference counting it wouldn't fail due to memory exhaustion either because it would optimize the excess variables out of existence. Similarly, if the F# example were written to use mutual recursion (or to otherwise fool the optimizer) then it would fail just as C++ does.

A better example is a much simpler one.... Imagine the "main" of a large program.

int main (void)
{
int beetroot[HUGE];
do_something_with_beetroot(beetroot);
everything_else();
return EXIT_SUCCESS;
}

Suppose the function spends most of it's time in "everything_else", but the array beetroot is very large. In a GC system beetroot will be deallocated and the memory freed. In static memory systems and most reference counting systems that array won't be deallocated until the program has exited. So, it will take up memory throughout the entire execution cycle. (Even if the memory pages it uses are swapped to disk it will still use address space and time to swap the data to disk). "GCC magic" was introduced to GCC and the Linux kernel to prevent this problem.

Jagan said...
This comment has been removed by the author.
Jagan said...

As Marius Bancila, noted you don't need a shared_ptr for this. Even if you believe that memory management comparison in C++ vs F# should be fair then why not use unique_ptr ?

This is what I have.

std::unique_ptr> loop(std::unique_ptr> tmp, int i)
{
if (i <= 0)
{
return tmp;
}
else
{
auto tmp2 = std::move(loop(std::move(tmp), i - 1));
++(*tmp2)[0];
return tmp2;
}
}


int main()
{
TimeTaken([]() -> bool {
auto tmp = std::make_unique>(10 * 1024 * 1024, 0);
tmp = std::move(loop(std::move(tmp), 4000));
std::cout << (*tmp)[0] << "\n";
return true;
}, "Time taken for recursive loop using unique_ptr");

}

output:
4000
Time taken for recursive loop using unique_ptr.Time taken:46.88 ms.

Jagan said...

If you thought std::move() is a bit of cheating, then I am creating unique_ptrs in the fn. loop() instead of std::move().



typedef std::unique_ptr> Doubles;

Doubles loop(Doubles&& tmp, int i)
{
if (i <= 0)
{
return std::make_unique>(tmp);
}
else
{
auto tmp2 = std::move(loop(std::make_unique>(tmp), i - 1));
++(*tmp2)[0];
return tmp2;
}
}


int main()
{
TimeTaken([]() -> bool {
auto tmp = std::make_unique>(32 << 20, 0);
std::cout << "Trying to recursive loop through 4000 times with the data of size:" << tmp->size() << "\n";
tmp = std::move(loop(std::move(tmp), 4000));
std::cout << (*tmp)[0] << "\n";
return true;
}, "Time taken for recursive loop using make_unique");

}

Jon Harrop said...

@Jagan: "As Marius Bancila, noted you don't need a shared_ptr for this. Even if you believe that memory management comparison in C++ vs F# should be fair then why not use unique_ptr?". This is not a comparison of C++ vs F#. This is a comparison of shared_ptr vs tracing garbage collection. That's why I've used shared_ptr.

Jon Harrop said...

@Unknown: "There is no need for the shared_ptr here, using just vector is enough". The sole purpose is to study the promptness of shared_ptr vs tracing garbage collection.

Jon Harrop said...

@Current: Actually, I deliberately wrote these programs to inhibit tail call optimization (there are no tail calls here) so it cannot be the cause.

"GCC magic was introduced to GCC and the Linux kernel to prevent this problem". Absolutely. I would argue that a better solution would be to move decrements earlier when possible on the basis of liveness analysis. However, that approach still produces some floating garbage.

Current said...

I see what you mean. I still can't wrap my head around how this happens though.

You write "This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected."

How does that work? I don't know about OCaml's internals, but in other languages "tmp" and "i" will be put on the stack. Or more likely a pointer to tmp will be put there. They won't be removed from the stack until the function has exited. So, when the GC probes the stack it will find the pointer to tmp and hang onto it. The compiler must be doing something special here to drop the reference to tmp, such as reusing the pointer to point to something else. I can't see how GC can create this behaviour on it's own.