Wednesday, 16 November 2011

What is your garbage collector collecting?

As a garbage collector runs it recycles unreachable values. A well known result in GC theory is that values tend to die in "clumps" rather than individually. This post visualizes these clumps.

We instrumented a C++ program equipped with our own mark-region collector (using 64kB regions) that is equivalent to the following F# implementation of a list-based n-queens solver in order to gather information about the structures being collected:
let safe x1 y1 x2 y2 =
  x1 <> x2 && y1 <> y2 &&
    x2 - x1 <> y2 - y1 && x1 - y2 <> x2 - y1

let ps n =
  [ for i in 1 .. n do
      for j in 1 .. n do
        yield i, j ]

let rec filter x0 y0 = function
  | [] -> []
  | (x1, y1 as q)::xys as list ->
      let xys' = filter x0 y0 xys
      if safe x0 y0 x1 y1 then
        if System.Object.ReferenceEquals(xys', xys) then list else
          q::xys'
      else
        xys'

let rec search n nqs qs ps a =
  match ps with
  | [] -> if nqs=n then a+1 else a
  | (x, y as q)::ps ->
      search n nqs qs ps a
      |> search n (nqs+1) (q::qs) (filter x y ps)

let solve n =
  search n 0 [] (ps n) 0

Stopping this program on its 1,000th GC cycle when it is solving the 11-queens problem, extracting references between allocated but unmarked cons cells, filtering out connected components with 3 or more vertices and visualizing the result using Mathematica's GraphPlot function gives the following:

In this case, we find that just 6% of the cons cells die alone (i.e. singleton lists) and 84% die in a clump of two cons cells, 6.5% die in a clump of 3 cons cells and 3.4% die in clumps containing four or more cons cells. Only a handful of clumps had interesting structures.

Although these results suggest that it might be easy to obtain a speedup by adding a special case to the list type for lists containing two elements we have found that this is not the case because the required changes also add expensive operations elsewhere, e.g. destructuring the special two-element list now incurs an allocation. Therefore, the next best solution would be to optimize the garbage collector for this case but it would be prudent to analyze the clusters collected during the running of other programs first. Analyzing the garbage generated by a program using red-black trees would be interesting...

Monday, 14 November 2011

The LMAX disruptor and Baker's Treadmill

A new retail financial trading platform called LMAX recently published their work on new software technology for low-latency servers that they call the "disruptor":

We have noticed a striking similiarity between their disruptor and an old garbage collection algorithm from 1992 called Baker's Treadmill:

The LMAX disruptor and the story behind it are very interesting in their own right and well worth reading up on. In their system, binary messages come from the "receiver" and are distributed to the "journaller", "replicator" and "unmarshaller" in parallel and the "unmarshaller" then passes on the deserialized messages to the "consumer". The core of their idea is to accomplish all of this message passing using a single shared data structure, the disruptor, rather than using several separate concurrent queues.

Baker's Treadmill is a low-latency garbage collection algorithm. Allocated blocks in the heap are linked together to form a cyclic doubly-linked list that is divided into four segments using four iterators that chase each other around the ring as heap blocks are allocated, traced and freed. In addition to making all of the necessary operations incremental, this algorithm is interesting because it migrates heap blocks between its "to", "from", "new" and "free" spaces without physically copying them as so-called copying GC algorithms do (e.g. Cheney semi-space or the nursery in a generational GC). Although this kind of logical migration is very valuable it is surprisingly uncommon in GC research literature. The VCGC algorithm is another example of a GC algorithm that logically migrates heap blocks between "young", "old" and "dead" spaces or "epochs" as the authors call them.

Beyond the striking resemblance of the disruptor to the Treadmill, there are some important differences:

  • The disruptor is a thread-safe concurrent data structure whereas the Treadmill was designed for single-threaded use.
  • The disruptor was designed to flow data from producers to consumers whereas the Treadmill also requires the ability to move an element from one segment of the ring to another.

Despite these differences, we think it is interesting to observe the similarities between these two different data structures and to note that they were both designed for low latency. Given that Baker's Treadmill is now quite dated in terms of low latency GC algorithms, perhaps future work on low-latency data structures can borrow from more recent GC theory?


Saturday, 12 November 2011

Real garbage collector characteristics

The trade-offs between tracing and reference counting garbage collectors are nicely demonstrated by systems like F# and Mathematica. F# inherits garbage collection from .NET which uses a conventional generational tracing garbage collector with three generations and a Large Object Heap (LOH). Mathematica uses reference counting with language semantics that make it impossible to create cycles in the heap.

The following Mathematica program creates a balanced binary tree 25 levels deep that contains 225 branches stores it in a variable and then mutates the variable back to the empty list:

Because Mathematica uses reference counting, the act of resetting the variable back to the empty list decrements the reference count for the root of our tree back to zero, causing an avalanche of reference counts in the branches down to the leaves also being decremented back to zero, reclaiming all of the space that was consumed by the tree that is now unreachable. Moreover, Mathematica is serial so all of this work is done on the main thread, blocking everything. Consequently, resetting the variable to the empty list takes a whopping 2.213s.

Surprisingly, although Mathematica has clearly done a lot of work that is presumably involved in releasing the space consumed by the tree it does not actually return any memory to the OS and continues to consume almost half of the entire 32-bit address space on this machine!

This program may be written as follows in F#:

type t = Leaf | Branch of t * t  let rec deepTree = function   | 0 -> Leaf   | n -> Branch(deepTree(n-1), deepTree(n-1))  let mutable t = deepTree 25  t <- Leaf  System.GC.Collect()

In this case, creating the tree takes 10.2s, resetting the tree takes a fraction of a millisecond and explicitly invoking the garbage collector (which actually returns all of the memory back to the OS) takes 0.44s.

This highlights some of the trade-offs involved both in using systems that provide garbage collection as a user and in creating such systems as a developer.

Friday, 11 November 2011

Classifying garbage collection algorithms

Richard Jones' excellent new book about garbage collection has rekindled my fascination with this important subject. The Wikipedia page about GC is disappointingly poor quality so it is productive to review the main points here.

GC algorithms are categorized into four main families with production garbage collectors often combining algorithms from different families. The families are copying, mark-sweep, mark-compact and reference counting. The first three are all tracing collectors that work by tracing all reachable values by starting from the global roots (global variables and locals held on the stack).

Copying collectors allocate into one space and then "evacuate" the reachable heap blocks (called "survivors") into another space before clearing the space they came from. The Cheney semi-space algorithm is the simplest copying collector: it uses two spaces and copies from one to the other and back again. The advantage of copying collection is that many heap-allocated blocks are deallocated simultaneously simply by resetting a pointer. This is ideal when only a small proportion of the allocated values survive a collection. As most values die young, the copying collection algorithm is commonly used for the nursery generation in a generational collector.

Mark sweep is the oldest garbage collection algorithm (McCarthy 1959) and works by tracing all reachable values and then deallocating all of the remaining unreachable values. This algorithm offers relatively high throughput and can be made incremental (low latency) using Dijkstra's tricolor marking scheme but is (arguably) prone to fragmentation. Consequently, the mark-sweep algorithm is commonly used for the old generation of generational collectors.

Mark compact traces reachable values and then slides them together. This avoids the problem of fragmentation that can (allegedly) afflict mark-sweep collectors but the throughput of mark-compact collectors is apparently so poor that they are practically unheard of.

Reference counting works by having each value count the number of references to it. This requires the program to register locally held references by incrementing the reference count of the value referred to and then decrementing it again when the reference is no longer held. This is typically done by decrementing the reference count when the local reference falls out of scope in the source code although liveness analysis can be exploited to find a tighter bound on when a reference is no longer needed. When a reference count is decremented to zero there are no longer any references to the value so it is unreachable and can be deallocated. Interestingly, reference counting is the opposite of tracing because it works by pursuing dropped references rather than by following live references. The advantages of reference counting are its simplicity, determinism in the context of single-threaded programs (multithreaded programs race to decrement and, therefore, deallocate non-deterministically) and that it can easily be made incremental by maintaining a queue of reference counts that have yet to be decremented. The disadvantages are that it leaks cycles and has very poor throughput. However, cycles can be collected either by a backup tracing collector or by using a specific cycle collector such as the trial deletion algorithm, and throughput can be improved by deferring decrements (although this recovers the disadvantage of unpredictability).

Latency is a big issue in the context of garbage collection so algorithms are also classified according to their latency characteristics. Stop-the-world algorithms are relatively simple but high latency, incurring arbitrarily-long pause times during which the mutator threads (those running the actual program) make no progress at all. The .NET 4 server GC is an example of a stop-the-world GC in production. GHC also uses a stop-the-world GC. The advantages of stop-the-world garbage collection algorithms are simplicity and high throughput. These GCs often incur pauses of the order of one second, which is totally unacceptable for many soft real-time applications such as visualization, games and (ironically!) servers.

Incremental GCs break up the work of a full GC and interleave the resulting bits of work in with the execution of the program. For example, OCaml's GC performs a "slice" of the work required to collect the old generation every time the young generation is collected. The advantages of incremental GCs are low latency compared to stop-the-world (e.g. ~10ms pauses with OCaml) and simplicity compared to concurrent GCs.

Today, the term "parallel GC" is used to describe a GC that uses multiple threads to speedup the collection of garbage. The .NET 4 server GC is an example of a parallel GC because it uses one thread per core to mark the heap in parallel. Concurrent GC means the GC runs at the same time as the mutators. For example, the .NET workstation GC is an example of a concurrent GC in production. Concurrent garbage collectors are further refined into on-the-fly and mostly-concurrent algorithms. An on-the-fly GC is defined by Jones et al. as one that never suspends more than one mutator at a time (i.e. there is no stop-the-world phase) although I suspect he meant that mutators are suspended independently of each other. Finally, a mostly-concurrent GC is one that does suspend all mutator threads simultaneously but only for a short period of time and not for an arbitrarily-long GC phase. For example, the .NET workstation GC pauses all mutator threads while the global roots are examined but runs concurrently with them during the marking and sweeping of the whole heap so it is a mostly concurrent garbage collector.

These classifications by family of algorithm and by latency characteristics are the most important ones.


Thursday, 3 November 2011

Applying optimization algorithms to profits

As a technology company, we like to apply technical solutions to problems at all levels. Our board of directors even apply technical solutions to the problem of company direction.

Business can be thought of as an optimization algorithm: tweaking stuff and things in order to maximize company profits. Interestingly, we use a number of different kinds of optimization algorithm when dictating the direction of the company.

We begin new product lines based on experience but continue to optimize our products based on customer feedback, trying to solve the problems that are most important to our customers. For example, our F# for Numerics library started life as our second attempt at selling libraries to F# users (our first attempt was F# for Visualization) and we provided the features we thought would be most useful. Customers inevitably requested more features including both technical features like parallel matrix inversion with arbitrary-precision rational arithmetic but also non-technical features such as licences allowing them to bundle our library in their own commercial products. This approach of incrementally improving products is a kind of local optimization algorithm, like gradient descent. This is a low-risk approach that yields small improvements in profits.

Over the years, we have been commissioned to write many books and reports and have published many books ourselves. We often use content from one successful book as the inspiration for the next. For example, our 2004 book OCaml for Scientists was surprisingly successful and became the inspiration for our 2006 book F# for Scientists and more recently Visual F# 2010 for Technical Computing. This is a kind of evolutionary algorithm because the next generation of books are derived from the previous generation and share some of their "DNA" with them. This is a riskier approach with variable results. We take a lot of risks and expect a significant proportion of our new products to fail but every now and then we come up with a new product that ends up earning as much revenue as all of the others combined.

At this level of abstraction, computational algorithms can be used for all sorts of weird and wonderful applications. Doubtless we all use them without even realising it!


Wednesday, 7 September 2011

New book on garbage collection

After 15 years, Richard Jones has written a new monograph about garbage collection called The Garbage Collection Handbook. This supercedes his previous seminal book on the subject Garbage Collection: algorithms for automatic dynamic memory management.

In particular, this new book covers state-of-the-art techniques including parallel, incremental, concurrent and real-time garbage collection algorithms.

I'll post a review as soon as I've read my copy!


Sunday, 27 March 2011

Pros and cons of OCaml

The advantages and disadvantages of the OCaml programming language.

Pros:

  • More powerful type inference than any other language: OCaml even infers union and class types!

  • Powerful module system lets you parameterize modules over other modules easily and safely with full compile-time checking and minimal run-time overhead.

  • Structural typing of modules, polymorphic variants and classes improves brevity, closing the gap with dynamic languages, but can obfuscate error messages.

  • Powerful integrated macro system lets you alter the language's syntax on-the-fly and write parsers quickly and easily.

  • Good serial performance from the x64 code gen.

  • Easy-to-use REPL.

  • Lots of high-quality tools and libraries.

Cons:

  • No overloading can make numerical code over many different types (e.g. complex numbers, vectors and matrices) tedious.

  • Poor multicore support means poor performance on today's computers.

  • Some basic functionality missing (e.g. no 32-bit floats, no try..finally construct).

  • Some basic functionality is very inefficient (e.g. 32-bit ints).

  • Poor floating point performance from the x86 code gen.

  • 16MB limit on strings and arrays on 32-bit systems.

  • No way to override equality, comparison and hashing for new types and no way to catch errors created by this when the default structural versions are not applicable. Type classes are a solution to the former problem and equality types are a solution to the latter.

  • No value types means more memory use and worse performance in some important situations (e.g. filling a hash table with float keys and values in OCaml is 17× slower than F#).

Sunday, 13 March 2011

Alternatives to Numerical Recipes

The Jet Propulsion Laboratory at Nasa once hosted an interesting web page listing better alternatives to the infamous book Numerical Recipes. Here is a copy courtesy of The Wayback Machine:

There is no single alternative to Numerical Recipes. The authors of Numerical Recipes provide a superficial overview of a large amount of material in a small volume. In order to do so, they made many unfortunate compromises.

It is naïve to hope that every computational problem can be solved by a simple procedure that can be described in a few pages of chatty prose, and using a page or two of Fortran or C code. Today's ambitions for correctness, accuracy, precision, stability, "robustness", efficiency, etc. demand sophisticated codes developed by experts with deep understanding of their disciplines. We have long ago outgrown the capabilities of the simplistic approaches of 30 years ago.

Steve Sullivan has constructed a FAQ (Frequently Asked Questions) list on numerical analysis. The size of the list will give you some idea of the scope of the field. Some books are reviewed in section q165.

It would be unproductive to try to list all the excellent textbooks on numerical analysis. A few of them are (in alphabetical order of primary author):

  • Kendall Atkinson, Numerical Analysis.
  • Ward Cheney and David Kincaid, Numerical Mathematics and Computing, Brooks-Cole (Third edition, 1994). Aharon Naiman has prepared slides for a series of lectures he teaches at Jerusalem College of Technology using this text.
  • George Forsythe, Michael Malcolm and Cleve Moler, Computer Methods for Mathematical Computations, Prentice-Hall (1977). This book is probably out of print. It is a predecessor to the book by Kahaner et. al., and written in a similar style. It is less comprehensive than the newer work.
  • Francis B. Hildebrand, Introduction to Numerical Analysis, McGraw-Hill (1956, 1974), Dover (1987 0-486-65363-3). This is one of the best books ever written on numerical analysis. It's out-of-date in some areas, most notably in Least Squares computation. Hildebrand has an engaging and transparent style of exposition, similar to Press et. al. This book is, however, a mathematically sound reference to material of the same era as presented in much of Numerical Recipes.
  • David Kahaner, Cleve Moler and John Nash, Numerical Methods and Software, Prentice-Hall (1989). ISBN 0-13-627258-4. This book is probably closest in style to Numerical Recipes, but was written by practitioners in the field, rather then by experts in a different field.
    Diskette included.
  • John H. Mathews, Numerical Methods: for Mathematics, Science & Engineering, 2nd Ed., ISBN 0-13-624990-6 and 0-13-625047-5, Prentice Hall Inc. (1992). With purchase of this book free software is available from the following links:
  • John H. Mathews and Russell W. Howell,
    COMPLEX ANALYSIS: for Mathematics and Engineering, Third Edition, 1997, ISBN 0-7637-0270-6.
    With purchase of this book, free software is available.
  • Zdzislaw Meglicki has posted the text for a course on advanced scientific computing at Indiana University.
  • G. W. "Pete" Stewart posted the following in
    na-net:
    "I have recently published a book entitled `Afternotes on Numerical Analysis'... It is a series of 22 lectures on elementary numerical analysis. The notes themselves were prepared after the lectures were given and are an accurate snapshot of what went on in class. Although they are no substitute for a full-blown numerical analysis textbook, many people have found them a useful supplement to a first course. The book is published by SIAM. For further information contact service@siam.org." and on 2 Jan 1997:

    I have just completed a new set of afternotes and have posted them on the web. The original afternotes were based on an advanced undergraduate course taught at the University of Maryland. The present notes are based on the follow-up graduate course. The topics treated are approximation\,---\,discrete and continuous\,---\,linear and quadratic splines, eigensystems, and Krylov sequence methods. The notes conclude with two little lectures on classical iterative methods and nonlinear equations

    The notes may be obtained by anonymous ftp at thales.cs.umd.edu in /pub/afternotes or by browsing my homepage
    http://www.cs.umd.edu/~stewart/.
    I will be grateful for any comments, corrections, or suggestions.

There are excellent texts and reference works that focus on narrow portions of
the discipline of numerical analysis. Consider, for example:
  • Gene H. Golub and Charles F. Van Loan, Matrix Computations, Johns Hopkins (first edition 1983, second edition 1989, third edition 1996). ISBN 0-8018-5413-X (0-8018-5414-8 paper).
  • Ernst Hairer, Syvert Paul Nørsett, Gerhard Wanner, Solving Ordinary Differential Equations I: Nonstiff Problems, Springer-Verlag (1987 3-540-17145-2 0-387-17145-2). A second edition has appeared but the ISBN's here are for the first.
  • Ernst Hairer, Gerhard Wanner, Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems, Springer-Verlag (1991: 3-540-53775-9 and 0-387-53775-9; 1996: 3-540-60452-9). This book and the previous one are highly regarded.
  • Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems, Prentice-Hall (first edition 1974), SIAM Press (second edition 1995) ISBN 0-89871-356-0.
  • Philip J. Davis and Philip Rabinowitz, Methods of Numerical Integration, Academic Press (second edition 1984) ISBN 0-12-206360-0.
  • Ingrid Daubechies, Ten Lectures on Wavelets, SIAM Press.
  • Jorge J. Moré and Stephen J. Wright, Optimization Software Guide, Frontiers in Applied Mathematics 14, Society for Industrial and Applied Mathematics (1993). About evenly divided between algorithms and software, both public-domain and commercial. (This book actually covers a fair amount of the content of Numerical Recipes, especially those parts that the authors of NR deemed too complex to do well.)
  • Spaeth, Mathematical Algorithms for Linear Regression (1987).
If you have been using Numerical Recipes for software, we recommend that
you contact the computing professionals in your organization. For JPL users, you can contact the Computational Mathematics Subgroup, or obtain the Math77 and mathc90 libraries of mathematical software directly. There is also a substantial amount of software and information about software on-line.

For one-of-a-kind computations, we recommend MATLAB (The MathWorks, Inc.).

Wednesday, 23 February 2011

Scala's premature release

Back in August 2010, Martin Odersky raised eyebrows by stating that "Scala is foremost an industrial language". We commented on this triumph of hope over reality at the time, citing Scala's poor IDE support as a major headache for industrial users that had deterred us from adopting this academically-interesting programming language.

This rookie mistake of marketing a product months before it is ready is all too common. Several of our young client companies made the same mistake. One had even been paying an entire sales division not only salaries but bonuses for over a year before their first product was released!

The problem is not just the wasted marketing effort but that potential customers are deterred by their bad experiences with the immature product. This is reflected in Jonathan Edwards' recent article "Switching to Plan J", where he writes:

"My experiment with Scala is not working out. It’s just not ready for prime time..." - Jonathan Edwards

and others echo the sentiment:

"I’ve found exactly the same thing..." - Nat
"I agree with every point you made in this article... Tools are awful, even though years have been spent developing them." - Mark
"...plagued by continuous problems..." - Roy Batty

Martin Odersky now says that he is "painfully aware of the Eclipse issues" and Scala Solutions are working to produce reliable IDE support for Scala but he also admits that Emacs has been his IDE of choice for Scala development.

Hopefully the potential users who have been disappointed by today's Scala will try the language again once a first attempt has been made to address core issues like the reliability of the IDE support.


Wednesday, 9 February 2011

Size of industrial F# code bases

Almost exactly a year ago, we published a blog post stating that we had 345kLOC of production OCaml code and 171kLOC of production F# code. Today, we have 261kLOC of production F# code!

Sunday, 9 January 2011

Sweeping 700× faster

Our initial experiments using the new garbage collector design indicate that it dramatically improves the performance of the slowest GC phase (sweeping) as expected, to the extent that it brings the overall performance of our C++ prototype within 10% of OCaml without introducing any of the overheads of generational garbage collection:

Moreover the algorithm is very simple and, in particular, parallel and concurrent variants should be much easier to design with this style of collector than with generational collectors because regions are a suitable granularity.

The sweep phase of the GC accounted for around a third of the total running time of the entire program using the traditional algorithm with allocated and free lists, with each sweep taking around 70µs. Using the new bitwise algorithm, the sweep phase takes just 100ns and accounts for just 0.7% of the total running time of the program.

Our prototype mark region collector using a fake marking phase is now between 2 and 10% slower than OCaml without having sacrificed either its better performance on other benchmarks or its multicore capability. However, our new design has increased the cost of both allocation and marking. Although they accounted for a tiny proportion of the total running time, it remains to be seen whether or not this new GC algorithm will be faster than a traditional generational collector in this best-case scenario for generational collection when it is implemented in a real VM. In particular, the overheads of HLVM's current shadow stack may well make it slower.

Our collection strategy also captures the benefits of Appel's semi-generational collection algorithm. Appel's algorithm compacted nursery survivors to one end of the nursery and used the remaining space as a smaller nursery repeatedly until the entire nursery was full of reachable values that were then promoted to the old generation. Our region-based collector can allow each thread to sweep its own region locally, independently of all other regions and cores. This also leads to the same non-linearity that makes Appel's algorithm so effective but without the overheads of copying. Using regions containing 62 blocks each 64 bytes long, local sweeping increased the number of allocations performed before a new region was required from 62 to a whopping 3,000 allocations. Obtaining a new region requires synchronization, so this simple improvement has reduced the rate of synchronizations by around a factor of 50. With 8-byte regions, the 11-queens problem can be solved with a single region and, therefore, without any contention at all between threads.

Therefore, we believe our new GC design will allow us to make HLVM competitively performant on functional benchmarks like this without losing its substantial advantages on imperative code, where it is often several times faster than OCaml.


Saturday, 8 January 2011

The importance of locality and sparsity in memory management

Our previous articles describing the disadvantages of generational garbage collection and our prototype mark-region memory management system designed for HLVM originally showed that region-based allocation and deallocation has the potential to be only 4-20% slower than OCaml's generational collector. However, our more recent work that was designed to be more realistic by deferring deallocations to bulk GC cycles was significantly slower, around twice as slow as OCaml.

There are several differences between the stack-based deallocation scheme used in the first benchmark and the GC-like deallocation scheme used in the second benchmark that have the potential to account for this performance degradation:

  • The mock GC introduced mark bytes into the allocated values and marked them as unreachable when they fell out of scope in the mutator.
  • The mock GC allocated by popping a reference of the top of a region's free list.
  • The mock GC deallocated by pushing a reference onto the free list of the reference's region.
  • The mock GC added an "allocated list" that is used to record everything that has been allocated.
  • Upon deallocation, the mock GC removed a reference from the allocated list by overwriting it with the last element and reducing the length of the allocated list by one.

The overhead of the new "mark phase" that marks values as unreachable when they fall out of scope is minimal. The mark phase in HLVM's real GC accounts for less than 2% of the total running time of this benchmark and it does significantly more work (traversing the shadow stack) than the mark phase in this benchmark.

Using the free list as a stack would cause subsequent allocations to be contiguous in memory if and only if the free list happens to be ordered, i.e. allocations and deallocations are in LIFO order. This was the case in the first benchmark but not the second.

Upon collection, the allocated list was traversed sequentially. However, the way in which references were removed from the allocated list may well have been the critical design flaw. Specifically, moving an the element from the back of the allocated list into the middle, to overwrite a removed element changes the order of the list slightly. We suspected that the disorder would accumulate over time, destroying the locality of the references in the allocated list. Consequently, a sequential traversal of the allocated list is likely to have been of little benefit because the subsequent elements of the allocated list would have referenced random locations. Moreover, that disorder would have been passed on to the free lists, which would have seen values freed in random order rather than sequentially.

We had speculated that sparsity was largely responsible for the remaining performance gap between OCaml and all of the strategies based upon per-value deallocation because OCaml's generational GC is able to sweep runs of contiguously-allocated values from the nursery generation in constant time. On the basis of this, we predicted that a derivative of mark-region capable of deallocating contiguous runs of values from a region might get significantly closer to OCaml's performance on this benchmark. In particular, the performance profile of our prototype indicates that 18% of the total time is spent allocating and 17% is spent collecting. Furthermore, the L2 cache is around 50% slower than the L1 cache on this machine so the performance of the mutator might be 50% worse due to poor locality of reference in the second benchmark. These figures suggest that improving locality might double performance and, therefore, make our solution as fast as OCaml.

To answer some of our questions, we wrote a simulation of the prototype (!) in the F# programming language and used it to gather relevant statistics. This proved to be extremely successful and led to several major insights.

Firstly, values die over time so longer gaps between GC cycles means a higher proportion of unreachable values. The following graph illustrates the relationship between the number of allocations performed between collections and the proportion of values that remain reachable:

If GC cycles are separated by at least 300 allocations then more than half of the allocated values become unreachable and are swept. Therefore, we can optimize the allocated list operations under the assumption that most of the elements of the list will not survive a GC cycle.

Secondly, we found that the algorithm used to remove an element from the allocated list does indeed dominate the locality and, therefore, the performance of the entire program. The following graph illustrates the probability density of deallocations as a function of the length of the run of contiguously-allocated values that a value is in:

These results show that the original algorithm for removing references from the allocated list led to 45% of values being deallocated alone and significantly fewer being deallocated in contiguous runs. Therefore, it is clear that the original algorithm was indeed destroying locality when it reordered the references in the allocated lists.

In contrast, removing values from the allocated list using the order-preserving sliding compaction retained locality. In that case, only 0.02% of values were deallocated alone. In fact, the new algorithm is so good at preserving locality that values are more likely to be deallocated with a few neighbors than alone. Specifically, values are 4× more likely to be deallocated as part of a run of 23 contiguously-allocated values than they are to be deallocated alone.

These new results lend credence to our conjecture that exploiting sparsity by deallocating contiguous runs of values rather than individual values is the key to achieving performance comparable to that of a generational GC. However, we also now know that this will only be possible if the allocation-collection cycle preserves locality as much as possible.

The simplest way to preserve the order of allocations and exploit sparse deallocations is to side step the problem by changing the data structures involved:

  • Replace the free list with a bitvector.
  • Replace the mark bits in each value with a per-region mark bitvector.
  • Replace the allocated list with queues of full and non-full regions.

With 512 bits in a cache line, we can reserve the first cache line of each region to use as a bitvector for the entire region because we previously found that regions containing around this many values give near-optimal performance. Allocating from a region is then a matter of finding the first (un)set bit in the bitvector and the associated location, flipping the bit and returning the location. A contiguous sequence of values can be deallocated from a region by computing and applying a bitmask to the bitvector.

With regions conveying where allocated values are, there is no longer any need for an explicit allocated list. Therefore, the allocated list may be replaced with a global queue of regions. When a local region is filled it is enqueued on the global queue of full-regions and a region is dequeued from the global queue of non-full regions or freshly allocated if there are no non-full regions in the global queue. When a GC cycle occurs, the global regions are dequeued, swept and enqueued again on the appropriate queue, i.e. if a full region becomes non-full then it changes queues.

Incredibly, sweeping a region is now as simple as applying bitwise operations to the allocated- and marked-bitvectors in order to remove unmarked locations from the allocated bitvector.

With this new design, the operations that limited the performance of the old design will now be substantially faster and the locality of reference for the mutator threads will be greatly improved.

We are currently prototyping this new design. As we shall see in a future post, this new GC algorithm not only naturally lends itself to both parallelism and concurrency but is also almost a drop-in replacement for HLVM's current memory management subsystem.


Wednesday, 5 January 2011

Paul Graham's accumulator generator

Paul Graham once published an article about what he called "accumulator generators". This problem requires the existence of an unspecified numeric tower. Lisp happens to have one and it happens to be adequate for Paul Graham's examples.

You can implement a numeric tower in F# either using a union type (like type number = Int of int | Float of float) or by boxing everything. The following solution uses the latter approach:

let add (x: obj) (y: obj) =
match x, y with
| (:? int as m), (:? int as n) -> box(m+n)
| (:? int as n), (:? float as x)
| (:? float as x), (:? int as n) -> box(x + float n)
| (:? float as x), (:? float as y) -> box(x + y)
| _ -> failwith "Run-time type error"

let acc x =
let x = ref x
fun (y: obj) ->
x
:= add !x y
!x

let x : obj -> _ = acc(box 1)
do x(box 5)
do acc(box 3)
do printfn "%A" (x(box 2.3))

However, numeric towers are of little use in general purpose programming and usually do more harm than good. Trying to learn from these kinds of challenges can do more harm than good. The real question is: why we do not want a numeric tower, do not want to box and do not want run-time type promotion?

In other words, why didn't we just write:

let x = 1
let x = x + 5
ignore
(3)
let x = float x + 2.3

We know the type of x at every step. Every number is stored unboxed. And we know that this code cannot produce a run-time type error.