Monday, 23 November 2015

C++ vs C# performance [deleted]

The following answer to a question about C++ vs C# performance on Stack Overflow has sadly been deleted despite having 305 upvotes:


I often heard that people prefer C++ to C# mainly in the performance critical code,because the GC might turn up on critical path, causing the performance penalty.

I have heard that in some circles but never respectable circles.

For example, I consulted for a company in London who were selling stock exchange software that had been written in 1,000,000 lines of C++. Over 40 developers had been working on it for almost 15 years and they were convinced that C++ was the correct solution for such software because latency and throughput performance were both critical. They were achieving latencies as low as 50ms (with a single trader connected!) and throughput as high as 10k trades per second (tps). However, they were struggling to support more than 2,000 traders because they had several threads per trader (no async) and, in fact, traders were reporting latencies as high as six seconds because the latency of their C++ code increased exponentially with the number of traders. I rewrote their code in 3 months using F# on .NET and achieved latencies as low as 0.1ms and throughputs over 200ktps using just 6,000 lines of F#. My solution was fully asynchronous (supported over 10,000 simultaneous trader connections) and fault tolerant.

Now, I'm not saying that C++ could not have been used to achieve even better performance than mine. On the contrary, I'm sure it could have achieved better performance but I also believe it would have taken man-decades of work by real experts and cost millions of pounds. After all, there's a reason why the London Stock Exchange paid £18m for Millenium IT and their low-latency C++ solution. However, I do believe that the vast majority of the people who prematurely optimize away garbage collection don't know what they are talking about and would not be capable of building a good solution in any language. Such people usually only know C++ and have no knowledge of garbage collection algorithms, which is scary because C++ programmers reinvent GC algorithms every day. A good test is to ask them how garbage collection works. If they describe naive mark-sweep circa 1960 then they haven't done their homework.

On the other hand, some people write excellent low-latency and high-throughput code in garbage collected languages. For example, see the LMAX Disruptor (Java) and Rapid Addition FIX engine(C#). So people have written low-latency software in Java and C# and, therefore, it clearly is possible. In particular, the use of arrays of value types is a known but under-appreciated solution for low-latency programming on .NET.

However, when I read through the C++, I realized that C++ offers the smart pointer features in which the programmer did not need to worry about memory management. For example, the shared_ptr with reference counting will manage the memory for us. Hence,we did not really care about the life-time of an object and when did it being deleted. Wouldn't that similar to the C# GC and the destructor of the object would be called at the performance critical code?

Yes. C++ programmers often complain about tracing garbage collectors being non-deterministic and causing pauses. Thread-safe shared_ptr in C++ is non-deterministic because threads race to decrement the count to zero and the winner of the race condition is burdened with calling the destructor. And shared_ptr causes pauses when decrements avalanche, e.g. when a thread releases the last reference to a tree the thread is paused for an unbounded length of time while every destructor in the tree is called. Reference counting can be made incremental by queuing destructors but that recovers the non-determinism of tracing garbage collection. Finally, reference counting with shared_ptr is several times slower than tracing garbage collection because incrementing and decrementing counts is cache unfriendly.

On a related note, C++ programmers often mistakenly claim that shared_ptr collects garbage at the earliest possible point in the program and, therefore, collects more "promptly" than a tracing garbage collector can. In fact, scope-based reference counting like shared_ptr keeps floating garbage around until it falls out of scope which increases register pressure can even increase memory consumption compared to tracing garbage collection.

So shared_ptr is indeed nothing more than a poor man's garbage collector. After all, old JVMs and CLRs both used reference counting at some point in history and both dropped it in favor of better forms of garbage collection. Reference counting is only popular in C++ because there is no easy way to walk the stack and redirect pointers so accurate tracing collection is prohibitively difficult.

Also, another question is if we didn't use smart pointer in C++ and we just resort to raw pointer, we still need to call delete to clear the heap memory. So from my understanding, every object created by C++ or C# would still be destroyed but the difference is only in we manage the memory ourselves in C++ but in C#, we let the GC to manage it. So what is the NET effect of it when comparing C++ and C# since both object still need to be deleted?

In its simplest form, allocation in C++ boils down to calling a general-purpose shared (global) memory allocator like malloc and in C# it boils down to pointer bump allocating into a thread-local nursery generation (gen0). Consequently, ordinary allocation in C# is much faster than ordinary allocation in C++. However, that misrepresents real software. In practice, C++ programmers avoid calling the general purpose global allocator in favor of using thread-local pool allocators whenever possible. On the other hand, C# developers rely on the general purpose memory management solution provided by .NET because it greatly simplifies APIs (memory ownership has been abstracted away) and is more than fast enough in the vast majority of cases. In the few cases where the simple solution is not adequate, the C# developer drops to lower level C# code and writes a pool allocator using an array of value types.

So I'd probably just make two observations:

·       Accurate tracing garbage collection is extremely useful in general and is bundled with C# and prohibitively difficult with C++.

·       Memory management bit tricks (e.g. smuggling bits in pointers) are sometimes possible in C++ but prohibited in C#.

So there is no easy way to compare C++ and C# fairly in this context.

Moreover, memory management is arguably not the biggest performance concern anyway. Many other issues can have a significant effect such as the quality of generated code on obscure architectures (where C compilers are usually much more mature) vs JIT compiling for a specific CPU, vectorization like SIMD (.NET does little), JIT-compiled run-time-generated code (like regular expressions in .NET) vs an interpreter in C++ and compilation to GPUs or FPGAs.

I think the only piece of good advice I can give you here is: do your own research and don't listen to the unwashed masses.