Wednesday, 25 September 2013

How do reference counting and tracing garbage collection compare?

Reference counting works by counting the number of references to each object. When the reference count drops to zero the object is definitely unreachable and can be recycled. The first advantage is simplicity. The second advantage is that decrements can occur as locals fall out of scope so collection can be deterministic and predictable. The third advantage is that the asymptotic complexity of reference counting does not depend upon the total size of the heap. The first problem is that cycles keep reference counts above zero even when they are unreachable, so reference counting alone cannot handle the general case. The second problem with reference counting is that incrementing and decrementing a counter in an object every time a reference to it is copied or dropped is very computationally expensive because it incurs a cache miss even if nothing else in the object is read or written. The third problem is that multithreaded reference counting is non-deterministic because increments and decrements race. The fourth problem is that simple implementations like Boost's shared_ptr in C++ allow destructors to avalanche which incurs unbounded pause times.

The first problem is solved in some languages (e.g. Erlang, Mathematica) by imposing unidirectional heaps so cycles cannot be created by design and, therefore, reference counting is an accurate form of GC for them. The second problem can be addressed using a variety of techniques such as deferred decrements but they all add significant complexity and, therefore, undermine the first advantage. The third problem obviously negates the second advantage in the context of multithreaded programs and, in this era of multicore computing, most programs are multithreaded. The problem of avalanching destructors can be solved by adding a queue to defer destruction but this adds complexity and makes the solution unpredictable, i.e. negates both advantages.

Tracing garbage collection works by keeping track of references into the heap (the global roots) and then tracing through the heap to find unreachable objects eligible for collection. The first advantage of tracing collection is that it can be very simple (e.g. the mark-sweep collector in my HLVM project is ~20 lines of code). The second advantage is that it is always accurate (can collect cycles). The third advantages is that naive mark-sweep is much faster than naive reference counting and optimized tracing collectors (e.g. JVM and CLR) are significantly faster than the fastest reference counting collectors. The first disadvantage is that, although they can be deterministic (as OCaml is), they are unpredictable because it is not possible to predict when an object will be collected. The second disadvantage of naive tracing collectors is that mutator threads must all be paused for an unbounded amount of time during each collection cycle. The third disadvantage of tracing collection is that the time spent marking is proportional to the size of the entire heap.

The second disadvantage is easily reduced by adopting incremental mark-sweep and can be completely eliminated by more advanced techniques. The third disadvantage turns out to be practically irrelevant because the constant pre-factor is so small: even with the largest heaps in use today the mark phase is surprisingly quick.

There are some common myths surrounding this topic as well. Many people believe (incorrectly) that objects are always kept alive until the end of their scope in the source code and, therefore, that reference counting always collects at the earliest opportunity because it collects objects reachable from variables as they fall out of scope. In reality, this mental model is horribly broken. In reality, the GC acts at run-time when the code has been compiled so the concept of scope does not even exist. Not all locally-held references are spilled to the stack and even when a reference is spilled to the stack it is likely to be overwritten by another when it is no longer needed. Register allocators rely upon liveness analysis (not scope) to determine when references need to be kept.

Why is garbage collection an important feature?

  1. Garbage collection eliminates a class of bugs caused by erroneous memory management (forget to free, free too soon, free more than once).
  2. Garbage collection removes the need for APIs to describe contracts about memory management.
  3. Garbage collection facilitates programming styles such as first-class lexical closures in functional programming (see the Funarg problem).

What's new in garbage collection?

Since the 1950s there have been three main families of collectors: semi-space, mark-sweep and mark-compact. Almost all production GCs have been generational mark-sweep even though they exhibit pathological performance when the nursery is full of survivors because they are marked, (physically) copied and all references to them updated which is ~3x slower than necessary and is practically useful when filling a hash table with heap-allocated keys and/or values. The Beltway (2002) and Immix (2008) garbage collectors introduced the new family called the mark-region GCs. With a mark-region GC the entire heap is a collection of regions and so is the nursery generation so it can be logically aged by replacing it with another region when it is full of survivors. Sun's Hotspot JVM introduced the first mainstream mark-region GC with its G1 collector.

The advent of multicore in 2005 has meant more emphasis on parallel and concurrent garbage collectors. The Staccato (2008) garbage collector is the first that is simultaneously parallel and concurrent and real-time.

.NET represents a major lateral advancement among mainstream GCs because support for reified generics and value types allows .NET languages like C# and F# to express generic collections that can be filled with new values without requiring any heap allocation. For example, filling a Dictionary with ints, floats, complex numbers and low-dimensional vectors and matrices.

Saturday, 14 September 2013

A look at the IronJS source code

The author of IronJS has made some claims (e.g. F# can feel “like a slightly crappier version of C#”) that I find surprising so I'm reading his code. I find it to be quite odd for a variety of reasons:
  1. Surprisingly OOP (dozens of interfaces and classes and hundreds of augmentations, static members and overloads) given how well suited ML is to writing compilers.
  2. Lots of low-level bit-twiddling.
  3. Uses List.empty instead of [].
  4. Uses large tuples.
  5. Uses arrays of closures, e.g. Stmt : (Lexer.Token -> State -> Ast.Tree) array.
  6. Doesn't use a symbol table.
  7. Choice of data structures appears to incur lots of unnecessary boxing.
  8. Odd choice of data structures for memory representation, e.g. nullleft and stmt fields for a token are scattered.
  9. Doesn't seem to use concurrency or parallelism.
  10. Parser contains only 42 match expressions in 1,400 lines of code.
  11. Massive amounts of code duplication.
  12. Hand-rolled parser is unusual. Hand-rolled lexer without tables is even more unusual.
  13. Strange choice of syntax, e.g. lots of p |> csymbol instead of just csymbol p.
  14. Allegedly optimized but no structs.
  15. Uses HashSet.Add repeatedly instead of the declarative constructor from a seq.
  16. Doesn't use HashIdentity.Structural.
Furthermore, I don't recall ever seeing that author ask anyone for help with it and it looks like he could do with quite a bit of advice. So I'd take the "rewrite in C#" with a pinch of salt...
Here's an example of some code from the hand-rolled lexer that is using many comparisons of unicode characters when single table lookups could suffice:

let inline isAlpha c = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
let inline isDecimal c = c >= '0' && c <= '9'
let inline isOctal c = c >= '0' && c <= '7'
let inline isNonOctalDigit c = c = '8' && c = '9'
let inline isHex c = isDecimal c || (c >= 'A' && c <= 'F') || (c >= 'a' && c <= 'f')

Here's an example of some odd code that uses OOP methods to give names to the elements of a tuple when it should just use a record type instead of a tuple:

member x.TokenName =
  let s, _, _, _ = x.Token in s |> Lexer.Symbol.getName

member x.TokenValue =
  let _, v, _, _ = x.Token in v

member x.TokenLine =
  let _, _, l, _ = x.Token in l

member x.TokenColumn =
  let _, _, _, c = x.Token in c

I believe this is the offending type definition:

type Token = int * string * int * int

Here's another example of some strange code from the lexer where the programmer has manually assigned consecutive integers to symbols:

let [<Literal>] Break = 0
let [<Literal>] Case = 1
let [<Literal>] Catch = 2
let [<Literal>] Continue = 3
let [<Literal>] Default = 4
...

Note the complete absence of any metaprogramming here, which is surprising because it is written in a MetaLanguage (ML).

Here's another oddity, the use of new when constructing non-IDisposable objects:

new Dictionary<string, int>(

Yet another oddity, pretending that all type definitions are mutually recursive when they are actually completely independent:

and ScopeType
  = GlobalScope
  | FunctionScope
and EvalMode
  = Clean
  | Contains
  | Effected
and LookupMode
  = Static
  | Dynamic

Very odd to say that a Dictionary being faster than a Map is "sad" when it isn't even being mutated:

// Sadly a normal Dictionary is so
// much faster then a F# Map that
// it's worth using it
new Dictionary<string, int>(

In summary, although IronJS is written in F# it is not idiomatic F# code and there is a lot of room for improvement. I see no reason to believe that the problems with the IronJS code base have anything to do with the F# programming language.

Wednesday, 4 September 2013

Efficient Parallel Stencil Convolution in Haskell

Until recently, all research on parallel Haskell programs neglected CPU caches and existing high-performance software solutions like BLAS, LAPACK, FFTW and OpenCV. We criticized the paper Regular, shape-polymorphic, parallel arrays in Haskell for comparing parallel Haskell programs only with poorly-written C code, particularly because this buries the real performance differences in unnecessary cache misses.

In 2011, Ben Lippmeier, Gabriele Keller and Simon Peyton-Jones published a paper entitled Efficient Parallel Stencil Convolution in Haskell that is the first to address caches and locality in the context of parallel Haskell programs and to compare parallel Haskell programs to existing high-performance solutions. The paper discusses an algorithm for edge detection in computer vision that was 10x slower using the old REPA approach and walks through the design and implementation of a more advanced solution using new approaches that take CPU caches into account and, ultimately, attains performance comparable to the high-performance OpenCV library. We highly recommend this paper for anyone interested in the subject. Hopefully future work will progress to more advanced topics like Frigo’s trapezoidal space-time decomposition.

This appears to have marked a turning point in Haskell research. Several other contributors are now discussing locality and cache oblivious algorithms written declaratively in a purely functional style. Best of luck to them!

Tuesday, 3 September 2013

Hash table insertion performance: F# vs OCaml vs Haskell

Time taken to insert ten million int→T bindings for T as each of int, float, complex and string using F# on .NET 4.5, OCaml 3.12 and GHC 7.4.2:
On a 3.4GHz Intel Core i7-3770 running Windows 7.
Value types and reified generics facilitate .NET's very efficient Dictionary implementation when handling value types like int, float and complex (unmatched by any other language/VM including C, C++ and Java). F# is 50% faster than OCaml and Haskell using the reference type string perhaps because the .NET write barrier is significantly more efficient.
Note: we use "int" to refer to ordinary 32-bit ints in F# and Haskell but 31-bit ints in OCaml. Using 32-bit ints in OCaml would be much slower because they are boxed (i.e. heap allocated).