Sunday, 26 January 2014

On the performance of boxed tuples

Stephen Dolan, a PhD student at the University of Cambridge, published an interesting article A “sane” design for multicore OCaml recently and the following discussion on Reddit proved to be quite informative. In particular, more than one person asserted that ints, floats and tuples are fast in OCaml. In this blog post I’m going to take a look at tuple performance.

 

As Stephen points out, one might reasonably expect unboxed tuples to be faster for passing arguments to functions and returning multiple values from functions because the elements can stay in registers but slower for storing in the heap because they require multi-word reads and writes instead of a single word (a pointer to the existing tuple). However, HLVM has shown that unboxed tuples can be extremely fast so why the discrepancy?

 

The performance charactistics of different heap topologies are not quite so simple in a garbage collected environment. Two aspects of garbage collection affect the results: the write barrier and survivors. The write barrier is a relatively-slow piece of code injected whenever a program writes a reference into the heap in order to keep the garbage collector apprised of the constantly-changing heap topology. Therefore, writing an unboxed pair of ints into the heap requires two int writes whereas writing a boxed pair of ints into the heap requires one pointer write and a write barrier. Now, in order to record information for the garbage collector the write barrier always performs at least one write in addition to other housekeeping work. Therefore, writing a pair of ints will always be slower if the pair is boxed. In fact, the F# Journal article Pathological garbage collector behaviour found that a write that incurs the write barrier is 2.4x slower than a write that does not. Moreover, .NET is heavily optimized for mutable code so it has a very efficient write barrier whereas OCaml is heavily optimized for purely functional code and has a notoriously slow write barrier.

 

The next issue that complicates the performance of boxed vs unboxed tuples in the heap is survivors. Both OCaml and .NET use generational garbage collectors. New objects like boxed tuples are allocated in a nursery generation. When the nursery is full, surviving objects are identified and physically copied to the next generation. If a program violates the generational hypothesis (that most objects die young) by allocating many objects that survive then it incurs a performance overhead for marking the survivors, copying them into the next generation and fixing up all pointers to those objects to point at their new locations. If tuples are unboxed then none of these overheads exist.

 

So it is instructive to measure the performance of writing newly-allocated tuples into progressively longer slices of an array. We have done this using boxed tuples in OCaml and F# as well as unboxed structs in F#. The following graph visualizes the performance as a function of the size of the array slice:

For array slices containing up to 1,000 elements none of the tuples survive the first generation and the performance of the boxed tuples is only slightly worse than for the unboxed representation (probably due to the write barrier). For more than 1,000 elements the performance of boxed tuples in both F# and OCaml rapidly worsens until they are 10x slower than unboxed tuples for 1,000,000 elements.

 

So tuples can indeed be fast in OCaml but only if they are short-lived temporaries. If tuples survive the nursery generation (which is 256kB by default) then performance is very bad.

 

The poor performance of long-lived tuples in OCaml has actually been worked around on several occasions. The Map implementation is almost identical to the Set implementation except the key-value pairs have been manually unboxed into the variant type representing the AVL tree, resulting in a large amount of unnecessary code duplication. The Hashtbl implementation uses a custom bucket type that is a list where the key-value pairs in each cons cell have been manually unboxed. Variant types are most elegantly represented as a tag and argument. For multiple arguments, the argument can just be a tuple. This simple and efficient representation is used in HLVM and it works very well. In OCaml, the compiler unboxes multiple arguments as a special case in order to combat the performance of boxed tuples. This results in a language wart where brackets around the arguments to a variant type constructor alter the meaning from multiple arguments to a single argument that is a tuple.

 

Locality is another important aspect of the performance of boxed vs unboxed tuples. Consider sorting an array of pairs and then enumerating the sorted array. With an unboxed representation the elements of the array are physically moved into place within a contiguous block of memory and enumeration is cache friendly. With a boxed representation, sorting scrambles the pointers and enumeration then has worst-case cache behaviour.

 

Suffice to say, the performance of boxed tuples is not as clear-cut as one might imagine.

 

Sunday, 13 October 2013

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says:

Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable” – Wikipedia

Similar claims can even be seen in published research such as the paper “Down for the Count? Getting Reference Counting Back in the Ring”:

“Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” Blackburn et al.

On the other hand you can see statements by experts like Richard Jones, co-author of the excellent Garbage Collection Handbook, make statements like:

“More importantly, note also that even an immediate (i.e. non deferred) reference counter cannot reclaim objects as soon as they are no longer referenced as finalisation must be asynchronous (see Hans Boehm's POPL03 paper "Destructors, finalizers and synchronization").” – a post on the gc-list by Richard Jones.

Let’s have a closer look at the thinking behind this belief and test it with a simple program. The mental model that underpins this belief is that any function’s local variables are stored in separate slots in the function’s stack frame for the entire duration of a function’s body and, therefore, will be reachable from the point of view of the garbage collector for the duration of the call to the function. This mental model underpins exam and interview questions such as Is object eligible for garbage collection after “obj = null”? and When Is The Object Eligible For Garbage Collection?.

In reality, this mental model is simple, obvious and wrong. Why? Firstly, the garbage collector sees the run-time representation of a program after it has been subjected to transforms such as inlining, instruction reordering and code block reordering by the compiler that can mutilate the structure of a program beyond recognition and, consequently, concepts like scope that exist only in the source code and not in the compiled form are not visible to the garbage collector. Secondly, the register allocator does everything possible to keep references in registers and avoid spilling them to the stack and when they must be spilled it uses the results of liveness analysis to overwrite any dead references in the stack frame whenever possible. In fact, some compilers don’t even use stack frames, such as our own x86 JIT in F# and the HLVM project, and other compilers like SML/NJ convert every call into continuation style and put stack frames on the heap, splitting every segment of code between a pair of function calls in the source into its own separate function in the compiled form.

Enough theory, let’s take a look at some working code. Here is a simple example using tracing garbage collection in OCaml/F# where an argument tmp to a function dies in the middle of the function body and, in particular, before a recursive call:

let rec loop tmp i =
  if i<=0 then tmp else
    let tmp2 = loop (Array.copy tmp) (i-1)
    tmp2.[0] <- tmp2.[0] + 1
    tmp2

When run using loop (Array.init m id) n, this code clearly uses less than mn space and keeps on running indefinitely. This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected.

Here is the equivalent using scope-based reference counting in C++:

shared_ptr<vector<double> > loop(shared_ptr<vector<double> > tmp, int i) {
  if (i<=0) {
    return tmp;
  } else {
    shared_ptr<vector<double> > tmp1(new vector<double>(*tmp));
    shared_ptr<vector<double> > tmp2 = loop(tmp1, i-1);
    ++(*tmp2)[0];
    return tmp2;
  }
}

In contrast, this code clearly requires at least mn space when run, goes to swap and (on Windows) dies from out of memory. Unlike the OCaml/F# code, the scope-based reference counting using shared_ptr in C++ keeps the tmp array allocated for longer than necessary, right until the end of the function call.

This observation also destroys another popular memory management myth: that tracing garbage collection always requires more memory than reference counting.

If there is any advantage to the C++ then it is the presence of guarantees. The semantics of C++ guarantee that after the end of scope the object has been deleted. However, it is worth noting that this guarantee of determinism does not apply to objects shared between threads because in that situation the threads race to decrement the reference counter to zero and the winner of the race condition is burdened with executing the destructor.

Saturday, 12 October 2013

Memory management myths: determinism

Although the vast majority of programmers have now migrated to garbage collected languages and will probably never go back, there are still a few clinging to manual memory management. In most cases, the continued use of manual memory management is for good reason but some of these people are perpetuating myths in an attempt to justify avoiding garbage collection. Determinism can be a genuinely good reason to stick with manual memory management and is practically important in memory-constrained embedded devices. However, C++ programs are not as deterministic as people sometimes claim and, in particular, thread-safe reference counting using shared_ptr is non-deterministic. Specifically, threads holding references to shared reference-counted objects race to perform the final decrement and the thread that wins the race is responsible for destruction.

Thursday, 10 October 2013

Herb Sutter's favorite C++ 10-liner has a memory management bug

In a recently-posted video, Herb Sutter (a prominent C++ expert) describes his favorite C++ 10-liner as “a thread-safe reference-counted object cache”:

shared_ptr<widget> get_widget(int id) {
  static map<int, weak_ptr<widget>> cache;
  static mutex m;

  lock_guard<mutex> hold(m);
  auto sp = cache[id].lock();
  if (!sp) cache[id] = sp = load_widget(id);
  return sp;
}

This example is very interesting. Firstly, it manages to pull in reference counting, weak references and a mutex which are all very rare in modern programming. Secondly, it contains a memory leak that is difficult to fix in C++ because APIs are burdened with memory management details and this API is incapable of expressing deterministic cleanup because there is no facility for a widget's destructor to remove its entry in the map. Finally, the correct name for this data structure is a concurrent weak dictionary, specifically one with weak values. You'll find correct implementations of this data structure are widely available for C#, F# and Java such as the one here.

The obvious fix is to sweep stale entries from the map when get_widget is called but this leaves floating garbage in the map between calls to get_widget, is asymptotically less efficient and incurs unbounded pauses for an unbounded number of threads.

Update: Matthew Avery (from the USA) suggests altering the API and semantics of the functions involved so load_widget returns a shared_ptr with a custom deleter that removes the stale map entry as soon as a widget is destructed. If this idea can be made to work then it would be the only deterministic solution to have been proposed to date.

Wednesday, 25 September 2013

How do reference counting and tracing garbage collection compare?

Reference counting works by counting the number of references to each object. When the reference count drops to zero the object is definitely unreachable and can be recycled. The first advantage is simplicity. The second advantage is that decrements can occur as locals fall out of scope so collection can be deterministic and predictable. The third advantage is that the asymptotic complexity of reference counting does not depend upon the total size of the heap. The first problem is that cycles keep reference counts above zero even when they are unreachable, so reference counting alone cannot handle the general case. The second problem with reference counting is that incrementing and decrementing a counter in an object every time a reference to it is copied or dropped is very computationally expensive because it incurs a cache miss even if nothing else in the object is read or written. The third problem is that multithreaded reference counting is non-deterministic because increments and decrements race. The fourth problem is that simple implementations like Boost's shared_ptr in C++ allow destructors to avalanche which incurs unbounded pause times.

The first problem is solved in some languages (e.g. Erlang, Mathematica) by imposing unidirectional heaps so cycles cannot be created by design and, therefore, reference counting is an accurate form of GC for them. The second problem can be addressed using a variety of techniques such as deferred decrements but they all add significant complexity and, therefore, undermine the first advantage. The third problem obviously negates the second advantage in the context of multithreaded programs and, in this era of multicore computing, most programs are multithreaded. The problem of avalanching destructors can be solved by adding a queue to defer destruction but this adds complexity and makes the solution unpredictable, i.e. negates both advantages.

Tracing garbage collection works by keeping track of references into the heap (the global roots) and then tracing through the heap to find unreachable objects eligible for collection. The first advantage of tracing collection is that it can be very simple (e.g. the mark-sweep collector in my HLVM project is ~20 lines of code). The second advantage is that it is always accurate (can collect cycles). The third advantages is that naive mark-sweep is much faster than naive reference counting and optimized tracing collectors (e.g. JVM and CLR) are significantly faster than the fastest reference counting collectors. The first disadvantage is that, although they can be deterministic (as OCaml is), they are unpredictable because it is not possible to predict when an object will be collected. The second disadvantage of naive tracing collectors is that mutator threads must all be paused for an unbounded amount of time during each collection cycle. The third disadvantage of tracing collection is that the time spent marking is proportional to the size of the entire heap.

The second disadvantage is easily reduced by adopting incremental mark-sweep and can be completely eliminated by more advanced techniques. The third disadvantage turns out to be practically irrelevant because the constant pre-factor is so small: even with the largest heaps in use today the mark phase is surprisingly quick.

There are some common myths surrounding this topic as well. Many people believe (incorrectly) that objects are always kept alive until the end of their scope in the source code and, therefore, that reference counting always collects at the earliest opportunity because it collects objects reachable from variables as they fall out of scope. In reality, this mental model is horribly broken. In reality, the GC acts at run-time when the code has been compiled so the concept of scope does not even exist. Not all locally-held references are spilled to the stack and even when a reference is spilled to the stack it is likely to be overwritten by another when it is no longer needed. Register allocators rely upon liveness analysis (not scope) to determine when references need to be kept.

Why is garbage collection an important feature?

  1. Garbage collection eliminates a class of bugs caused by erroneous memory management (forget to free, free too soon, free more than once).
  2. Garbage collection removes the need for APIs to describe contracts about memory management.
  3. Garbage collection facilitates programming styles such as first-class lexical closures in functional programming (see the Funarg problem).

What's new in garbage collection?

Since the 1950s there have been three main families of collectors: semi-space, mark-sweep and mark-compact. Almost all production GCs have been generational mark-sweep even though they exhibit pathological performance when the nursery is full of survivors because they are marked, (physically) copied and all references to them updated which is ~3x slower than necessary and is practically useful when filling a hash table with heap-allocated keys and/or values. The Beltway (2002) and Immix (2008) garbage collectors introduced the new family called the mark-region GCs. With a mark-region GC the entire heap is a collection of regions and so is the nursery generation so it can be logically aged by replacing it with another region when it is full of survivors. Sun's Hotspot JVM introduced the first mainstream mark-region GC with its G1 collector.

The advent of multicore in 2005 has meant more emphasis on parallel and concurrent garbage collectors. The Staccato (2008) garbage collector is the first that is simultaneously parallel and concurrent and real-time.

.NET represents a major lateral advancement among mainstream GCs because support for reified generics and value types allows .NET languages like C# and F# to express generic collections that can be filled with new values without requiring any heap allocation. For example, filling a Dictionary with ints, floats, complex numbers and low-dimensional vectors and matrices.