Saturday, 2 February 2013

Curious about the hash table problem

An old post from Stack Overflow:


In short, even with the fix in the latest GHC, Haskell is still incapable of providing a dictionary (mutable or immutable) that is competitively efficient.
Haskell's hash tables were 32× slower than alternatives like C++ and .NET with GHC 6.10. That was partly due to a performance bug in the GHC garbage collector that was fixed for GHC 6.12.2. But Simon Marlow's results there show only a 5× performance improvement which still leaves Haskell's hash tables many times slower than most alternatives.
Purely functional alternatives are also much slower than a decent hash table. For example, Haskell'sIntMap is 10× slower than .NET's hash table.
Using F# 2010 and the latest Haskell Platform 2010.2.0.0 (released yesterday!) with GHC 6.12.3 on this 2.0GHz E5405 Xeon running 32-bit Windows Vista to insert 20M int->int bindings into an empty hash table we find that Haskell is still 29× slower than F# in real time and over 200× slower in terms of CPU time because the Haskell burns all cores:
GHC 6.12.3 Data.HashTable: 42.8s (new!)
.NET hash table:            1.47s
Provided you run only short-lived microbenchmarks you can disable the GHC garbage collector as Don Stewart suggests above. By asking for a nursery generation so large that this particular program will never fill it, he brought the time for the Haskell hash table down to only 1.5s here. However, this completely undermines the whole point of having a nursery generation and will massively degrade the performance of other code because freshly allocated values will now always be cold in the cache (which is why the nursery generation is typically the size of the L2 cache, orders of magnitude smaller than this).

Tuesday, 27 November 2012

New group: Pragmatic functional programming research

I've started a new Google Group to discuss issues surrounding the practical use of former research around functional programming including issues such as designing purely functional data structures to leverage mainstream virtual machines (JVM and CLR) and garbage collectors, wrapping mainstream libraries (e.g. Windows Presentation Foundation) to improve productivity from modern functional languages and so on.
Please join us here.

Saturday, 10 November 2012

Trying to get F# working on the Raspberry Pi


The Raspberry Pi uses an older ARMv6 core which is not well supported by modern software. In particular, Mono's JIT apparently doesn't support the hardware-accelerated floating point instructions used by that version of the ARM instruction set. The solution is apparently to use the soft-float version of Raspian that emulates floating point instructions in software (which will be extremely slow). However, trying to compile the current version of the F# sources on Github using the stock Mono (2.10.8) fails with the following null reference exception from within the F# compiler:

error FS0193: internal error: Object reference not set to an instance of an object

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object
  at Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Collections.FSharpList`1[System.String],System.String].InvokeFast[CcuThunk] (Microsoft.FSharp.Core.FSharpFunc`2 func, Microsoft.FSharp.Collections.FSharpList`1 arg1, System.String arg2) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Env.mkTcGlobals (Boolean compilingFslib, Microsoft.FSharp.Compiler.CcuThunk sysCcu, Microsoft.FSharp.Compiler.AbstractIL.ILGlobals ilg, Microsoft.FSharp.Compiler.CcuThunk fslibCcu, System.String directoryToResolveRelativePaths, Boolean mlCompatibility, Boolean using40environment, Boolean indirectCallArrayMethods, Boolean isInteractive, Microsoft.FSharp.Core.FSharpFunc`2 getTypeCcu) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Build+TcImports.BuildFrameworkTcImports (Microsoft.FSharp.Compiler.TcConfigProvider tcConfigP, Microsoft.FSharp.Collections.FSharpList`1 frameworkDLLs, Microsoft.FSharp.Collections.FSharpList`1 nonFrameworkDLLs) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.getTcImportsFromCommandLine$cont@322 (Exiter exiter, Microsoft.FSharp.Core.FSharpOption`1 displayPSTypeProviderSecurityDialogBlockingUI, Microsoft.FSharp.Compiler.LexResourceManager lexResourceManager, Microsoft.FSharp.Collections.FSharpList`1 sourceFiles, System.String assemblyName, Microsoft.FSharp.Compiler.TcConfig tcConfig, Microsoft.FSharp.Compiler.ErrorLogger errorLogger, Microsoft.FSharp.Core.Unit unitVar) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.getTcImportsFromCommandLine (Microsoft.FSharp.Core.FSharpOption`1 displayPSTypeProviderSecurityDialogBlockingUI, System.String[] argv, System.String defaultFSharpBinariesDir, System.String directoryBuildingFrom, Microsoft.FSharp.Core.FSharpOption`1 lcidFromCodePage, Microsoft.FSharp.Core.FSharpFunc`2 setProcessThreadLocals, Microsoft.FSharp.Core.FSharpFunc`2 displayBannerIfNeeded, Boolean optimizeForMemory, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.main1 (System.String[] argv, Boolean bannerAlreadyPrinted, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.mainCompile (System.String[] argv, Boolean bannerAlreadyPrinted, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.CommandLineMain+Driver.main (System.String[] argv) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.CommandLineMain.main (System.String[] argv) [0x00000] in <filename unknown>:0
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
  at Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Collections.FSharpList`1[System.String],System.String].InvokeFast[CcuThunk] (Microsoft.FSharp.Core.FSharpFunc`2 func, Microsoft.FSharp.Collections.FSharpList`1 arg1, System.String arg2) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Env.mkTcGlobals (Boolean compilingFslib, Microsoft.FSharp.Compiler.CcuThunk sysCcu, Microsoft.FSharp.Compiler.AbstractIL.ILGlobals ilg, Microsoft.FSharp.Compiler.CcuThunk fslibCcu, System.String directoryToResolveRelativePaths, Boolean mlCompatibility, Boolean using40environment, Boolean indirectCallArrayMethods, Boolean isInteractive, Microsoft.FSharp.Core.FSharpFunc`2 getTypeCcu) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Build+TcImports.BuildFrameworkTcImports (Microsoft.FSharp.Compiler.TcConfigProvider tcConfigP, Microsoft.FSharp.Collections.FSharpList`1 frameworkDLLs, Microsoft.FSharp.Collections.FSharpList`1 nonFrameworkDLLs) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.getTcImportsFromCommandLine$cont@322 (Exiter exiter, Microsoft.FSharp.Core.FSharpOption`1 displayPSTypeProviderSecurityDialogBlockingUI, Microsoft.FSharp.Compiler.LexResourceManager lexResourceManager, Microsoft.FSharp.Collections.FSharpList`1 sourceFiles, System.String assemblyName, Microsoft.FSharp.Compiler.TcConfig tcConfig, Microsoft.FSharp.Compiler.ErrorLogger errorLogger, Microsoft.FSharp.Core.Unit unitVar) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.getTcImportsFromCommandLine (Microsoft.FSharp.Core.FSharpOption`1 displayPSTypeProviderSecurityDialogBlockingUI, System.String[] argv, System.String defaultFSharpBinariesDir, System.String directoryBuildingFrom, Microsoft.FSharp.Core.FSharpOption`1 lcidFromCodePage, Microsoft.FSharp.Core.FSharpFunc`2 setProcessThreadLocals, Microsoft.FSharp.Core.FSharpFunc`2 displayBannerIfNeeded, Boolean optimizeForMemory, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.main1 (System.String[] argv, Boolean bannerAlreadyPrinted, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.Driver.mainCompile (System.String[] argv, Boolean bannerAlreadyPrinted, Exiter exiter, Microsoft.FSharp.Core.FSharpFunc`2 createErrorLogger) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.CommandLineMain+Driver.main (System.String[] argv) [0x00000] in <filename unknown>:0
  at Microsoft.FSharp.Compiler.CommandLineMain.main (System.String[] argv) [0x00000] in <filename unknown>:0

We'll keep trying...

Tuesday, 5 June 2012

Are functional languages inherently slow?

Functional languages require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Beware of self selection. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Functional languages will seem slower than C because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.

Sunday, 6 May 2012

Learning garbage collection theory


I want to learn the theory behind garbage collection. How do i go about it?
I am also a dabbler interested in garbage collection (to the point that I wrote my own garbage collected VM called HLVM). I learned by reading as many research papers on garbage collection as I could get my hands on and by playing with the ideas myself, both raw in my virtual machine and also by writing memory-safe high-level simulations.
The obvious answer is - a compiler textbook... The question is, is it necessary to learn lexical analysis, parsing and other stuff that usually precedes garbage collection in a text?
The lexical analysis, parsing and other stuff is not relevant to garbage collection. You might get an outdated cursory overview of garbage collection from a compiler book but you need to read research papers to get an up-to-date view, e.g. with regard to multicore.
In short, what are the prerequisites to learning about Garbage collection theory?
You need to know about basic graph theory, pointers, stacks, threads and (if you're interested in multi-threading) low-level concurrency primitives such as locks.
Garbage collection is all about determining reachability. When a program can no longer obtain a reference to a value because that value has become unreachable then the GC can recycle the memory that value is occupying. Reachability is determined by traversing the heap starting from a set of "global roots" (global variables and pointers on the thread's stacks and in the core's registers)
GC design has many facets but you might begin with the four main garbage collection algorithms:
  • Mark-and-sweep (McCarthy, 1960)
  • Mark-and-compact (Haddon and Waite, 1967)
  • Stop-and-copy (Cheney, 1970)
  • Mark-region (McKinley et al., 2007)
Perhaps the most notable evolution of these basic ideas is generational garbage collection, which was the defacto standard design for many years.
My personal feeling is that some of the obscure work on garbage collection conveys just as much useful information so I'd also highly recommend:
You might also like to study the three kinds of write barrier (Dijkstra's, Steele's and Yuasa's) and look at the card marking and remembered set techniques.
Then you might also like to examine the actual design decisions some implementors chose for language implementations like Java and .NET as well as the SML/NJ compiler for Standard ML, the OCaml compiler, the Glasgow Haskell compiler and others. The differences between the techniques adopted are as big as the similarities between them!
There are also some great tangentially-related papers like Henderson's Accurate Garbage Collection in an Uncooperative Environment. I used that technique to avoid having to write a stack walker for HLVM.
The memorymanagement.org website is an invaluable resource, particularly the glossary of GC-related terms. Finally, the Garbage Collection Handbook by Jones et al. is a superb monograph on this subject.

Why do garbage collectors pause?


GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run.
Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine grained communication between the mutators and GC.
In practice, all GCs lie somewhere between these two extremes. VCGC is primarily a snapshot GC but it uses a write barrier to keep the collector apprised of changes to the heap topology. Staccato was the world's first parallel and concurrent and real-time GC but it still batches up some operations in order to retain the efficiency of stack allocation.

Functors vs generics


From this Stack Overflow question:
If I understand it correctly, functor gives you a new set of functions from a type you provide
More generally, functors map modules to modules. Your Set example maps a module adhering to theORDERED_TYPE signature to a module implementing a set. The ORDERED_TYPE signature requires a type and a comparison function. Therefore, your C# is not equivalent because it parameterizes the set only over the type and not over the comparison function. So your C# can only implement one set type for each element type whereas the functor can implement many set modules for each element module, e.g. in ascending and descending order.
Even if functors are just generics-for-namespace, what's the significant advantage of that approach?
One major advantage of a higher-order module system is the ability to gradually refine interfaces. In OOP, everything is public or private (or sometimes protected or internal etc.). With modules, you can gradually refine module signatures at will giving more public access closer to the internals of a module and abstracting more and more of it away as you get further from that part of the code. I find that to be a considerable benefit.
Two examples where higher-order module systems shine compared to OOP are parameterizing data structure implementations over each other and building extensible graph libraries. See the section on "Structural abstraction" in Chris Okasaki's PhD thesis for examples of data structures parameterized over other data structures, e.g. a functor that converts a queue into a catenable list. See OCaml's excellent ocamlgraph library and the paper Designing a Generic Graph Library using ML Functors for an example of extensible and reusable graph algorithms using functors.
Classes can also be used as ad-hoc namespaces using nested classes.
In C#, you cannot parameterize classes over other classes. In C++, you can do some things like inheriting from a class passed in via a template.
Also, you can curry functors...