Tuesday, 5 June 2012

Are functional languages inherently slow?

Functional languages require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Beware of self selection. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Functional languages will seem slower than C because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.


Daniel Robinson said...

Great article. Thanks! Is it possible to achieve the performance of imperative languages through compiler optimizations? The chief benefit of functional programming is declarativeness. It seems compiler techniques such as TCL could be extended to transform even more code into imperative constructs, yielding the best of both worlds.

Tezka said...

+1 on a nice post. I really agree with you, MLTon is a gem with lots of nice features. I really wish it sees a revival of interest.

Jon Harrop said...

> Is it possible to achieve the performance of imperative languages through compiler optimizations?

In theory, compilers of purely functional languages could spot when a collection does not require persistence and swap it for a more efficient imperative collection (e.g. replace a hash trie with a hash table). In practice, that would be too unpredictable to be useful and the replacement might not always be faster.

Linear types could be used to convey to the compiler that there is only ever one reference to a value and, therefore, it can be modified in-place. Less declarative but more predictable.

In practice, I'm not sure the benefits outweight the costs...

Jules said...

Stalin Scheme is also worth mentioning. It is really fast but unfortunately quite old by now, and the ideas are largely undocumented except in code.

Ananth Balasubramaniam said...

Excellent article. It looks like functional languages on .NET/Mono like F# or Nemerle would straddle the best of both worlds. IMO, Mono is a great candidate to add features and try out stuff on because of it's open-source nature and overall solid code.

Lawson English said...

As I understand it (corrections welcome) method temp variables in Squeak Smalltalk are stack based also rather than heap-based objects.

Paul McJones said...

You start off strong, "If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program." but then switch to a bunch of "what-ifs" -- supposedly better garbage collectors, unboxing algorithms, and approaches to data structures. To be realistic, you should admit that today's operating systems, database managers, scripting languages, desktop applications, etc., are written in C or C++. Here's a nice summary table: The Programming Languages Beacon. Can you give a similar list of applications written in a functional language?

simplemath said...

.NET garbage collector was primary written in Lisp and then autorewritten to C.

Paul Stadig said...

Excellent read. I enjoyed your perspective, and agreed with many of your points.

A couple of nits:

"...and the resulting collections are still substantially slower to read than their imperative equivalents."

This is not true for Clojure's persistent vectors. Clojure's persistent vectors are constant time accessible, and cache friendly. One of the goals in the design of the persistent data structures was to be competitive with mutable arrays by maintaining the O(1) access time. They do this, and also add a O(1) conj operation to insert at the end.

"However, few practical applications benefit from persistence in practice so this is often not advantageous."

This statement might be true on the face of it...I'm not sure. In the case of Clojure, it combines persistent data structures with a paradigm for managing state, which include reference types, and a heavily trafficked idea of snapshot reads (made possible and efficient with persistent data structures). I know other languages like ML have reference types, but I don't know ML enough to know whether it has such a pervasive and consistent paradigm for managing state. I find Clojure and its persistent data structures to be very practical, and we have used it to build large distributed systems.

Louis Wasserman said...

Modern array computations with Haskell are really impressively fast, largely because the authors of modern Haskell array libraries have developed optimization techniques that allow code written with traditional functional idioms (map, filter, scan, etc.) to get compiled down to much more C-like code that does almost no intermediate allocation.

The technique is really impressive, not least because it only exploits "traditional" Haskell optimizations not specialized for array computations at all. And once it hooks into the new LLVM backend for GHC, it's _damn_ impressive. (http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghcs-new-llvm-codegen/ has some details, but that was in early 2010.)

Klapaucius said...

>probably under 1s on a modern machine!

No, not really:

Core i7 860

Klapaucius@CORE /cygdrive/d/tests
$ time mlton.bat
MLton 20100608 (built Fri Jun 11 12:02:32 GMT 2010 on APPLE)

real 0m0.070s
user 0m0.000s
sys 0m0.015s

Klapaucius@CORE /cygdrive/d/tests
$ cat hello.sml
print "hello";
Klapaucius@CORE /cygdrive/d/tests
$ time mlton.bat hello.sml

real 0m2.972s
user 0m0.015s
sys 0m0.000s

jackfoxy said...

This is a very timely article for my understanding, since I recently began a project to measure data structure performance in F#.

foobie42 said...

Regarding pass-by-value stuff in .NET - when creating a generic function, make sure to include then "when 'a : struct" constraint. Else boxing ensues.

-yet another Common Lisp hacker

foobie42 said...

One more thing, Toady. You were right. Static typing is good.

BTW, I'm writing a ML-inspired language that allows for failing occurs check and even circular type itself. There's no shameless-plug here, though.

But also Kenny was/is right. Did you have (positive/negative0 experience with dataflow? I once implemented a dataflow widget toolkit on top of XNA with F#. Can you cover dataflow too? It seems aligned with functional programming where side effects can be greatly abstracted away.

Mark Martin said...
This comment has been removed by a blog administrator.