Sunday, 24 December 2017

Does reference counting really use less memory than tracing garbage collection? Swift vs OCaml

The way software developers cling to folklore and avoid proper experimental testing of hypotheses really bothers me. I think perhaps the worst branch of computer science still riddled with myths and legends is memory management.
Despite over half a century of research on garbage collection showing that reference counting is inferior to tracing garbage collections algorithms (even when almost all GC research restricts consideration to Java when much better algorithms have been known for years) there are still many people who claim otherwise. C++ developers obviously still believe that reference counted smart pointers are superior to tracing garbage collection but now other people are doing it too. People who should know better.
This situation recently reared its ugly head again when Apple released a promising new programming language called Swift that shuns tracing garbage collection in favor of reference counting. Now, there are logical reasons for Apple to have chosen reference counting, most notably that their existing infrastructure uses it heavily. That I can tolerate. However, what I cannot tolerate is Apple putting lipstick on their pig by making technical claims about the superiority of reference counting over tracing garbage collection. Baseless claims.
Chris Lattner is a great guy and I have a lot of respect for him as the author of both LLVM and Swift but I'm amazed to see him make claims like:
"More pointedly, not relying on GC enables Swift to be used in domains that don’t want it - think boot loaders, kernels, real time systems like audio processing, etc.
...
One thing that I don’t think is debatable is that the heap compaction behavior of a GC (which is what provides the heap fragmentation win) is incredibly hostile for cache (because it cycles the entire memory space of the process) and performance predictability.
Given that GC’s use a lot more memory than ARC systems do, it isn’t clear what you mean by GC’s winning on heap fragmentation.
...
GC also has several *huge* disadvantages that are usually glossed over: while it is true that modern GC's can provide high performance, they can only do that when they are granted *much* more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance. Additionally, since the sweep pass touches almost all RAM in the process, they tend to be very power inefficient (leading to reduced battery life)."
Some of these beliefs are clearly wrong. For example, Docker and Mirage are both written in languages using tracing garbage collection (Go and OCaml, respectively). Others are less obviously wrong. For example, claiming that GCs "cycle the entire memory space of the process" and "the sweep pass touches almost all RAM in the process" is actually untrue but it isn't immediately obvious why. The reason is that tracing garbage collectors only care about pointers. When presented with a block of binary data (e.g. an array of numbers) that doesn't contain any pointers a tracing GC will not bother to trace it. None of that RAM gets touched. A tracing GC will only touch almost all RAM if the heap is composed almost entirely of pointers. For example, if you had a traditional purely functional collection like a balanced binary tree so big that it consumed all available memory. Even on a mobile phone that is extremely unlikely to occur. Also, Chris' claim about heap fragmentation is at best dubious. The primary source of defragmentation in a generational collector is copying survivors from the nursery into a mostly-contiguous part of an older generation. The nursery is deliberately sized so that this operation is expected to fit into CPU cache so it is not "incredibly hostile for the cache" at all. On the contrary, GC research has shown the cache friendliness of this operation to be one of the most important benefits of generational GCs. Finally, some of these claims are surprising to me and I am unwilling to accept them as fact without first running some tests in an attempt to falsify them. Most notably this belief that tracing garbage collection requires 3-4x more memory than reference counting. Where did these numbers come from?!
Before we get into my findings perhaps I should explain my unwillingness to accept this information. Perhaps the most obvious problem is that, if the performance of a tracing GC is crippled by less than 3x memory overhead, why is the default memory overhead of OCaml's GC set to just 80%? I've noticed that Apple aficionados often cite this paper by Hertz et al. in an attempt to justify their belief that tracing garbage collection requires 3-4x more memory than manual memory management in order to work efficiently. There are many problems with this. Firstly, extending the conclusion to reference counting requires you to believe that reference counting has no space overheads when, in fact, it obviously needs to store the reference counts and that it collects at the earliest possible point in a programs execution when, in fact, it doesn't. Secondly, that paper silently ignores the production-quality JVM collector and replaces it with a variety of toy collectors, most of which use algorithms the likes of which haven't been seen in production for decades. Am I supposed to just believe that they did an expert job implementing their toy collectors? Where is their source code so I can reproduce these results for myself? Smells fishy. Finally, they give no justification for their choice of benchmark programs, they end up dropping three of the eight benchmarks they started with because they couldn't get their code to work properly and some of the remaining benchmarks (e.g. a ray tracer) sound rather anomalous from a memory management perspective.
So I decided to test this hypothesis myself by benchmarking Swift against OCaml. Swift uses reference counting and OCaml uses tracing garbage collection. I am interested primarily in memory consumption but performance, compile times and code size will also be interesting. I thought about possible benchmark tasks. Given Chris' implication of high pointer density I thought it would be good to try some immutable collections. At first I thought of lists but then I figured trees would be more involved but still tractable in both languages. A common source of such collections is expression trees. Rather than writing an interpreter I thought it would be more interesting to write a program that computes the symbolic derivative of a mathematical expression. If we restrict consideration to integers, variables, addition, multiplication, raise to the power and logarithm then we can easily write a function that differentiates any such expression.
Here is a solution in Swift and a solution in OCaml. The algorithms should be identical but clearly aren't optimal. I'm much more familiar with OCaml than Swift so I reached out to the Swift community for code review and folded in one fix and some minor improvements but nobody could optimize it.
On a Raspberry Pi 3 running Swift 3.1.1 and OCaml 4.02.3 to compute up to the 9th derivative of x^x I get:
  • Swift requires 70% more chars of source code (3,939 vs 2,308 chars)
  • OCaml compiles over 3x faster than Swift with -O (1.7s vs 5.3s).
  • OCaml runs over 5x faster than Swift (1.469s vs 7.623s).
  • Swift consumes over 3x more resident memory (104MB vs 29MB).

Turns out the Swift code can be shrunk by prefixing function parameters with an underscore in order to avoid having to name all arguments at all call sites so the code sizes aren't amazingly interesting, except perhaps that it is awesome to see more mainstream languages providing union types and pattern matching. The compile times are interesting: OCaml is very fast so 3x slower than OCaml is actually pretty good. Many languages are orders of magnitude slower than OCaml at compilation. The run times are very interesting to me because I have previously noted reference counting in C++ running 10x slower than OCaml and fully expected Swift to be slow at this because reference counting is slow but exactly how slow I didn't know because I am aware that Swift goes to great lengths to try to optimise away increments and decrements of reference counts. The memory consumption is a really amazing to me. Swift uses over 3x more resident memory than OCaml. Wow. Not even close!
This cannot just be because every word-size datum in the heap has been augmented with a word-sized reference count because that could only account for a factor of two. I suspect OCaml's tracing garbage collector is collecting garbage earlier than Swift's reference counting. I'll wager part of the problem is that Swift leaks in the presence of recursion due to the use of scope-based reference counting without tail call elimination. If so, that is an obvious optimisation for Chris to put into his Swift compiler. I'll leave the task of replacing my recursive nest function with an iterative looping one to test this theory as an exercise for the reader.
Anyway, the conclusion seems clear: in at least this case reference counting requires more memory than tracing garbage collection. Therefore, it is unreasonable to make assertions like "tracing GC requires 3-4x more memory than reference counting to perform adequately".

Merry Christmas, y'all.

19 comments:

Yawar Amin said...

Hi Jon, I can't see a correlation between 'Swift uses more resident memory than OCaml' and 'ARC does not reduce memory consumption compared to a tracing GC', can you speak to why most of your benchmarked difference isn't just coming from confounding factors?

Unknown said...

Given that you open with a complain that others "avoid proper experimental testing of hypotheses", I was wondering where your analysis would be. I love ocaml, but it and swift are very different languages in more ways than just memory management :-)

-Chris

Chris Lattner said...

BTW, there is a post on reddit that claims that you're actually measuring the effect of OCaml's memoization:
https://www.reddit.com/r/programming/comments/7m1yc0/does_reference_counting_really_use_less_memory/

Yawar Amin said...

Hi Chris, that was me, I deleted the post after it was pointed out to me that this blog post isn't really measuring ARC vs GC performance. Thanks to Reddit's design though, the permalink still works. I have downvoted it though.

Jon Harrop said...

Guys, let me turn your questions around: how can anyone possibly do a "fair" "apples-to-apples" comparison that doesn't suffer from any such confounding factors? I could write this program in C++ using shared_ptr vs using a tracing GC but that tracing would be crippled by C++'s inability to walk the stack efficiently so it wouldn't be fair. I could write this benchmark so both solutions used the same data representation but that wouldn't be fair because it would cripple the tracing solution with the unnecessary overhead required by the reference counting implementation. However you try to phrase a solution I think using production-quality implementations for both is a very valuable data point because it shows what is possible from this two different approaches to garbage collection.

@Chris: Memoization is a red herring. Neither language is memoizing anything. OCaml might be doing a better job hoisting immutable constants like Int(0) but I've done it by hand and it makes no difference so that doesn't explain the observed difference. I've tried various other things too and cannot see anything other than the cost of reference counting vs tracing that might be responsible for the difference in behaviour. In point of fact, OCaml is an interesting point of comparison because it does almost no fancy optimisations at all and, yet, is still solving this problem using a lot less memory. Incidentally, if you increase the pressure on the GC to collect then OCaml's memory consumption goes down from 30kB to 6kB.

Chris Lattner said...

I agree that having an "apples-to-apples" comparison is extremely difficult. That said, this is a single datapoint without much analysis or evidence to support the claims you make. Proper claims would be something along the lines of "this single ocaml program is better than this single swift program in these ways when built in this way". You go beyond that by extrapolating limited data to specific claims of root causes without analysis.

That said, this is your blog and I have no problem with you doing that - I'll just point out that it makes the first sentence of your post extremely ironic.

Yawar Amin said...

Jon, for an apples-to-apples comparison you need a compiler wherein you can optionally turn off the GC, and also which has a ready-made reference-counting type so you don't have to waste time rolling your own. D comes to mind as the obvious candidate.

yminsky said...

Chris's point that a single example program doesn't get you terribly far in the analysis seems fair. It's intriguing to see how this one example plays out, but more examples and more analysis would be required to tell a convincing story.

That said, I think Jon has a point that there's some value in comparing mature implementations, since the limitations of baby implementations can often be confused with limitations of the fundamental approach.

Chris, I'm curious how convinced are you that ARC really is better than an OCaml style GC. The space overhead question is significant if true, but I'm not sure how true it really is. Similarly, arguments about determinism and the costs of compaction also leave me a little cold. Compaction is quite rare in ordinary OCaml programs, in my experience, so it's not clear that it should be a major driver of power use.

And the determinism story isn't entirely compelling either, since reference counting can lead to very large cascades of deallocation, which can potentially be quite expensive, whereas in an incremental GC, you can control the length of GC pauses pretty effectively, even using explicit timing information to decide how long to run for --- in other words, decoupling the collector from the mutator can give you more, rather than less control over pauses that are introduced by collection.

If you're interested, the following post outlines some work that we did to improve the latency of the GC, and to make it easier to control GC pauses interactively. I can imagine this kind of work being pretty effective in a context where one is building latency sensitive UIs, even though that's not the domain we've been thinking about it in.

https://blog.janestreet.com/building-a-lower-latency-gc/

Jon Harrop said...

@Chris: What kind of analysis and evidence would you like to see? Please bear in mind that I have only the most rudimentary Swift setup here. I just got this working on by x86 laptop as well and can repro the results there with Swift 4. I assume you can too?

@Yawar: "optionally turn off the GC". In all production tracing garbage collectors the entire system is designed and built around the GC and, consequently, the GC permeates everything. Consequently, you cannot extricate the GC from the rest of the system and, therefore, you cannot "turn off the GC". For example, if you use D then your results will only speak about poor GC implementations which won't be very interesting.

@Yaron: "Compaction is quite rare in ordinary OCaml programs, in my experience, so it's not clear that it should be a major driver of power use". IIRC, LablGTK actually disables compaction and nobody ever noticed any ill effects. I tried disabling compaction on my own OCaml code after reading about that and observed no changes. Would be interesting if you could disable it on some of your code too? That being said, my understanding is that heap fragmentation is very rarely an issue these days anyway. In fact, that is one of the reasons why I think non-moving GCs are an interesting alternative to conventional ones like OCaml's.

Yawar Amin said...

Jon, I agree it wouldn't be an interesting comparison. Unfortunately, it is probably the only apples-to-apples comparison you can make, if you extend your question to ARC vs (any) GC.

On a related note, I am starting to realise how astonishingly little scientific work has been done answering these major computing questions using evidence-based studies. In any other scientific field we have tons of replicated studies. In CS, maybe a few dozen (to be generous). Check out this post episode where they talk about this more https://www.functionalgeekery.com/episode-55-andreas-stefik/

My point is that we as practitioners are kind of stumbling around in the dark, trying to pick up the pieces of evidence-based work that is missing from academia. It's a very weird situation.

Jon Harrop said...

@Yawar: I wouldn't consider that to be an apples-to-apples comparison. Indeed, my single biggest gripe with that paper is that they benchmarked only their own home grown GC implementations and neglected the production-quality one that was already there.

You're absolutely correct about the lack of empirical evidence employed when making decisions. I never expected this benchmark to be the be-all and end-all that finally put the RC/GC debate to rest, of course. I only ever wanted to force people into getting a reality check on some of these unsubstantiated claims that get bandied about. At the very least I think this casts reasonable doubt on the commonly-held assumption that RC is more memory efficient.

macias said...

@yminsky,

"the determinism story isn't entirely compelling either, since reference counting can lead to very large cascades of deallocation, which can potentially be quite expensive"

Expensive != deterministic. Something can be very expensive, yet 100% deterministic.

From mere user of programming language I am really annoyed that I have to manually point out every moment, that I really, really would like to free some resources, because it is important for the logic of the program.

So M&S approach may be more efficient computationally, but it still leaves the code sprinkled with "free/dealloc/dispose/whatever".

So, the question is -- what is the total cost, i.e. developer time + debugging time + execution time?

Jon Harrop said...

@macias: "So M&S approach may be more efficient computationally, but it still leaves the code sprinkled with "free/dealloc/dispose/whatever"."

There is no reason for code to be littered with such things. If you look at F#, for example, such things are very rare. Indeed, I cannot recall the last time I had to free something manually.

In a functional language you just write a higher-order function that allocates a resource, applies a given function to use the resource and then frees the resource even in the exceptional case.

FĂ©lix said...

The current state-of-the-art GCs also need objects to have a header; AFAIK, we don't know how to make a concurrent tracing GC that doesn't have one.

On my machine, the Swift example suffers a great deal from its string management characteristics: strings are value types that implement copy-on-write semantics, the value type itself being 24 bytes large. If you replace String with UnicodeScalar for your Expr needs, the peak memory usage drops by about 50MB on my Mac. Accounting for the rest of the memory is an exercise left to the author, who should have done that diligence anyway if comparing memory management systems was a serious goal.

Also, except for intermediate strings when printing the result, all objects in the program stay live. Memory is never reclaimed, so this is really only exercising the allocation half of memory management, and inconclusively avoiding a lot of the purported downsides of GCs.

Jon Harrop said...

@Felix:

"The current state-of-the-art GCs also need objects to have a header; AFAIK, we don't know how to make a concurrent tracing GC that doesn't have one."

Yes. Objects have headers in both OCaml and F# here, of course.

"On my machine, the Swift example suffers a great deal from its string management characteristics: strings are value types that implement copy-on-write semantics, the value type itself being 24 bytes large. If you replace String with UnicodeScalar for your Expr needs, the peak memory usage drops by about 50MB on my Mac."

That is unfortunate. Symbols are represented efficiently in all other languages, of course. If I change the representation of variables in the Swift code from strings to ints to avoid that deficiency then I see a 32% reduction in resident memory but it is still using far more memory than either OCaml or F#.

"Also, except for intermediate strings when printing the result, all objects in the program stay live. Memory is never reclaimed, so this is really only exercising the allocation half of memory management, and inconclusively avoiding a lot of the purported downsides of GCs."

That is not true. If you run the GC stats in OCaml, for example, you see 96 collections of the nursery that collected 72% of all values that were allocated there because this benchmark generates a lot of garbage, of course. All intermediate expressions are garbage including the earlier derivatives and all temporaries when simplifications occur.

Are you saying that Swift fails to reclaim any of the garbage generated in this benchmark?

Chris Lattner said...

Random points:
- I have no opinion about the OCaml garbage collector, I know nothing about what tradeoffs it makes or anything else.
- sizeof(String) is planned to shrink at least to 16 bytes, and may go all the way to 8. (on 64-bit targets). This is planned for Swift 5 in Fall 2018.
- This specific benchmark uncovered many specific examples where Swift is doing a poor job optimizing recursive enums. It isn't particularly representative of general Swift code, but Swift should certainly do better.
- The issues uncovered really have nothing to do with the general claim about RC vs GC.

Thank you for the benchmark, it will certainly contribute to making Swift better for this sort of workload!

Jon Harrop said...

> I have no opinion about the OCaml garbage collector, I know nothing about what tradeoffs it makes or anything else.

FWIW, a couple of points about OCaml's GC: it is heavily optimised for symbolic code like this benchmark because OCaml was built to write theorem provers which do this all the time and are performance critical; and it does not allow mutator threads to run in parallel. These days OCaml has found new applications such as server-side programming in the finance industry where it turns out this style makes concurrent programming for mathematical and logic (e.g. trade matching) easy and efficient.

OCaml has also been used to write iOS apps, most notably the Facebook app. So it is a good yard stick to compare Swift to.

> - sizeof(String) is planned to shrink at least to 16 bytes, and may go all the way to 8. (on 64-bit targets). This is planned for Swift 5 in Fall 2018.

Excellent. BTW, I really like the way Swift lets to place indirections by hand. I wish more high-level languages did that.

> This specific benchmark uncovered many specific examples where Swift is doing a poor job optimizing recursive enums. It isn't particularly representative of general Swift code, but Swift should certainly do better.

Good to know. I hope it helps!

> The issues uncovered really have nothing to do with the general claim about RC vs GC.

Really? Do you think a future version of Swift will beat OCaml's memory consumption on this benchmark then?

> Thank you for the benchmark, it will certainly contribute to making Swift better for this sort of workload!

Not at all, thank you very much for making LLVM and Swift which are both awesome projects and all the best for the future! :-)

Yawar Amin said...

Jon,

> OCaml has also been used to write iOS apps, most notably the Facebook app.

While the former may be true, no one from Facebook has ever confirmed the latter.

Chris Lattner said...

Thanks Jon, happy holidays + new year!

-Chris