Wednesday, 16 November 2011

What is your garbage collector collecting?

As a garbage collector runs it recycles unreachable values. A well known result in GC theory is that values tend to die in "clumps" rather than individually. This post visualizes these clumps.

We instrumented a C++ program equipped with our own mark-region collector (using 64kB regions) that is equivalent to the following F# implementation of a list-based n-queens solver in order to gather information about the structures being collected:
let safe x1 y1 x2 y2 =
  x1 <> x2 && y1 <> y2 &&
    x2 - x1 <> y2 - y1 && x1 - y2 <> x2 - y1

let ps n =
  [ for i in 1 .. n do
      for j in 1 .. n do
        yield i, j ]

let rec filter x0 y0 = function
  | [] -> []
  | (x1, y1 as q)::xys as list ->
      let xys' = filter x0 y0 xys
      if safe x0 y0 x1 y1 then
        if System.Object.ReferenceEquals(xys', xys) then list else
          q::xys'
      else
        xys'

let rec search n nqs qs ps a =
  match ps with
  | [] -> if nqs=n then a+1 else a
  | (x, y as q)::ps ->
      search n nqs qs ps a
      |> search n (nqs+1) (q::qs) (filter x y ps)

let solve n =
  search n 0 [] (ps n) 0

Stopping this program on its 1,000th GC cycle when it is solving the 11-queens problem, extracting references between allocated but unmarked cons cells, filtering out connected components with 3 or more vertices and visualizing the result using Mathematica's GraphPlot function gives the following:

In this case, we find that just 6% of the cons cells die alone (i.e. singleton lists) and 84% die in a clump of two cons cells, 6.5% die in a clump of 3 cons cells and 3.4% die in clumps containing four or more cons cells. Only a handful of clumps had interesting structures.

Although these results suggest that it might be easy to obtain a speedup by adding a special case to the list type for lists containing two elements we have found that this is not the case because the required changes also add expensive operations elsewhere, e.g. destructuring the special two-element list now incurs an allocation. Therefore, the next best solution would be to optimize the garbage collector for this case but it would be prudent to analyze the clusters collected during the running of other programs first. Analyzing the garbage generated by a program using red-black trees would be interesting...

Monday, 14 November 2011

The LMAX disruptor and Baker's Treadmill

A new retail financial trading platform called LMAX recently published their work on new software technology for low-latency servers that they call the "disruptor":

We have noticed a striking similiarity between their disruptor and an old garbage collection algorithm from 1992 called Baker's Treadmill:

The LMAX disruptor and the story behind it are very interesting in their own right and well worth reading up on. In their system, binary messages come from the "receiver" and are distributed to the "journaller", "replicator" and "unmarshaller" in parallel and the "unmarshaller" then passes on the deserialized messages to the "consumer". The core of their idea is to accomplish all of this message passing using a single shared data structure, the disruptor, rather than using several separate concurrent queues.

Baker's Treadmill is a low-latency garbage collection algorithm. Allocated blocks in the heap are linked together to form a cyclic doubly-linked list that is divided into four segments using four iterators that chase each other around the ring as heap blocks are allocated, traced and freed. In addition to making all of the necessary operations incremental, this algorithm is interesting because it migrates heap blocks between its "to", "from", "new" and "free" spaces without physically copying them as so-called copying GC algorithms do (e.g. Cheney semi-space or the nursery in a generational GC). Although this kind of logical migration is very valuable it is surprisingly uncommon in GC research literature. The VCGC algorithm is another example of a GC algorithm that logically migrates heap blocks between "young", "old" and "dead" spaces or "epochs" as the authors call them.

Beyond the striking resemblance of the disruptor to the Treadmill, there are some important differences:

  • The disruptor is a thread-safe concurrent data structure whereas the Treadmill was designed for single-threaded use.
  • The disruptor was designed to flow data from producers to consumers whereas the Treadmill also requires the ability to move an element from one segment of the ring to another.

Despite these differences, we think it is interesting to observe the similarities between these two different data structures and to note that they were both designed for low latency. Given that Baker's Treadmill is now quite dated in terms of low latency GC algorithms, perhaps future work on low-latency data structures can borrow from more recent GC theory?


Saturday, 12 November 2011

Real garbage collector characteristics

The trade-offs between tracing and reference counting garbage collectors are nicely demonstrated by systems like F# and Mathematica. F# inherits garbage collection from .NET which uses a conventional generational tracing garbage collector with three generations and a Large Object Heap (LOH). Mathematica uses reference counting with language semantics that make it impossible to create cycles in the heap.

The following Mathematica program creates a balanced binary tree 25 levels deep that contains 225 branches stores it in a variable and then mutates the variable back to the empty list:

Because Mathematica uses reference counting, the act of resetting the variable back to the empty list decrements the reference count for the root of our tree back to zero, causing an avalanche of reference counts in the branches down to the leaves also being decremented back to zero, reclaiming all of the space that was consumed by the tree that is now unreachable. Moreover, Mathematica is serial so all of this work is done on the main thread, blocking everything. Consequently, resetting the variable to the empty list takes a whopping 2.213s.

Surprisingly, although Mathematica has clearly done a lot of work that is presumably involved in releasing the space consumed by the tree it does not actually return any memory to the OS and continues to consume almost half of the entire 32-bit address space on this machine!

This program may be written as follows in F#:

type t = Leaf | Branch of t * t  let rec deepTree = function   | 0 -> Leaf   | n -> Branch(deepTree(n-1), deepTree(n-1))  let mutable t = deepTree 25  t <- Leaf  System.GC.Collect()

In this case, creating the tree takes 10.2s, resetting the tree takes a fraction of a millisecond and explicitly invoking the garbage collector (which actually returns all of the memory back to the OS) takes 0.44s.

This highlights some of the trade-offs involved both in using systems that provide garbage collection as a user and in creating such systems as a developer.

Friday, 11 November 2011

Classifying garbage collection algorithms

Richard Jones' excellent new book about garbage collection has rekindled my fascination with this important subject. The Wikipedia page about GC is disappointingly poor quality so it is productive to review the main points here.

GC algorithms are categorized into four main families with production garbage collectors often combining algorithms from different families. The families are copying, mark-sweep, mark-compact and reference counting. The first three are all tracing collectors that work by tracing all reachable values by starting from the global roots (global variables and locals held on the stack).

Copying collectors allocate into one space and then "evacuate" the reachable heap blocks (called "survivors") into another space before clearing the space they came from. The Cheney semi-space algorithm is the simplest copying collector: it uses two spaces and copies from one to the other and back again. The advantage of copying collection is that many heap-allocated blocks are deallocated simultaneously simply by resetting a pointer. This is ideal when only a small proportion of the allocated values survive a collection. As most values die young, the copying collection algorithm is commonly used for the nursery generation in a generational collector.

Mark sweep is the oldest garbage collection algorithm (McCarthy 1959) and works by tracing all reachable values and then deallocating all of the remaining unreachable values. This algorithm offers relatively high throughput and can be made incremental (low latency) using Dijkstra's tricolor marking scheme but is (arguably) prone to fragmentation. Consequently, the mark-sweep algorithm is commonly used for the old generation of generational collectors.

Mark compact traces reachable values and then slides them together. This avoids the problem of fragmentation that can (allegedly) afflict mark-sweep collectors but the throughput of mark-compact collectors is apparently so poor that they are practically unheard of.

Reference counting works by having each value count the number of references to it. This requires the program to register locally held references by incrementing the reference count of the value referred to and then decrementing it again when the reference is no longer held. This is typically done by decrementing the reference count when the local reference falls out of scope in the source code although liveness analysis can be exploited to find a tighter bound on when a reference is no longer needed. When a reference count is decremented to zero there are no longer any references to the value so it is unreachable and can be deallocated. Interestingly, reference counting is the opposite of tracing because it works by pursuing dropped references rather than by following live references. The advantages of reference counting are its simplicity, determinism in the context of single-threaded programs (multithreaded programs race to decrement and, therefore, deallocate non-deterministically) and that it can easily be made incremental by maintaining a queue of reference counts that have yet to be decremented. The disadvantages are that it leaks cycles and has very poor throughput. However, cycles can be collected either by a backup tracing collector or by using a specific cycle collector such as the trial deletion algorithm, and throughput can be improved by deferring decrements (although this recovers the disadvantage of unpredictability).

Latency is a big issue in the context of garbage collection so algorithms are also classified according to their latency characteristics. Stop-the-world algorithms are relatively simple but high latency, incurring arbitrarily-long pause times during which the mutator threads (those running the actual program) make no progress at all. The .NET 4 server GC is an example of a stop-the-world GC in production. GHC also uses a stop-the-world GC. The advantages of stop-the-world garbage collection algorithms are simplicity and high throughput. These GCs often incur pauses of the order of one second, which is totally unacceptable for many soft real-time applications such as visualization, games and (ironically!) servers.

Incremental GCs break up the work of a full GC and interleave the resulting bits of work in with the execution of the program. For example, OCaml's GC performs a "slice" of the work required to collect the old generation every time the young generation is collected. The advantages of incremental GCs are low latency compared to stop-the-world (e.g. ~10ms pauses with OCaml) and simplicity compared to concurrent GCs.

Today, the term "parallel GC" is used to describe a GC that uses multiple threads to speedup the collection of garbage. The .NET 4 server GC is an example of a parallel GC because it uses one thread per core to mark the heap in parallel. Concurrent GC means the GC runs at the same time as the mutators. For example, the .NET workstation GC is an example of a concurrent GC in production. Concurrent garbage collectors are further refined into on-the-fly and mostly-concurrent algorithms. An on-the-fly GC is defined by Jones et al. as one that never suspends more than one mutator at a time (i.e. there is no stop-the-world phase) although I suspect he meant that mutators are suspended independently of each other. Finally, a mostly-concurrent GC is one that does suspend all mutator threads simultaneously but only for a short period of time and not for an arbitrarily-long GC phase. For example, the .NET workstation GC pauses all mutator threads while the global roots are examined but runs concurrently with them during the marking and sweeping of the whole heap so it is a mostly concurrent garbage collector.

These classifications by family of algorithm and by latency characteristics are the most important ones.


Thursday, 3 November 2011

Applying optimization algorithms to profits

As a technology company, we like to apply technical solutions to problems at all levels. Our board of directors even apply technical solutions to the problem of company direction.

Business can be thought of as an optimization algorithm: tweaking stuff and things in order to maximize company profits. Interestingly, we use a number of different kinds of optimization algorithm when dictating the direction of the company.

We begin new product lines based on experience but continue to optimize our products based on customer feedback, trying to solve the problems that are most important to our customers. For example, our F# for Numerics library started life as our second attempt at selling libraries to F# users (our first attempt was F# for Visualization) and we provided the features we thought would be most useful. Customers inevitably requested more features including both technical features like parallel matrix inversion with arbitrary-precision rational arithmetic but also non-technical features such as licences allowing them to bundle our library in their own commercial products. This approach of incrementally improving products is a kind of local optimization algorithm, like gradient descent. This is a low-risk approach that yields small improvements in profits.

Over the years, we have been commissioned to write many books and reports and have published many books ourselves. We often use content from one successful book as the inspiration for the next. For example, our 2004 book OCaml for Scientists was surprisingly successful and became the inspiration for our 2006 book F# for Scientists and more recently Visual F# 2010 for Technical Computing. This is a kind of evolutionary algorithm because the next generation of books are derived from the previous generation and share some of their "DNA" with them. This is a riskier approach with variable results. We take a lot of risks and expect a significant proportion of our new products to fail but every now and then we come up with a new product that ends up earning as much revenue as all of the others combined.

At this level of abstraction, computational algorithms can be used for all sorts of weird and wonderful applications. Doubtless we all use them without even realising it!