Friday, 27 March 2015

Functional programming and databases

Eric Lippert made some interesting statements about the disadvantages of functional programming on Stack Overflow:

 

“When Jane Smith in accounting gets married and changes her name to Jane Jones, the database backing the business process that prints her paycheque had better be all about handling that sort of mutation. When you fire the machine gun at the alien, most people do not mentally model that as the construction of a new alien with fewer hit points; they model that as a mutation of an existing alien's properties… My point is that if you have an object that represents an employee, it makes sense to think of the operation "change the name of the employee" to be a mutation of the object representing the employee that does not change object identity. When Jane Smith changes her name you don't create a different employee called Jane Jones that is otherwise the same. There aren't two employees with two different names. It is natural to model this process as a mutation of an object, not as the construction of a new object.”

 

I don’t know about aliens but I can tell you many horror stories I have witnessed in industry caused by the approach to databases that Eric is advocating.

 

Eric’s interpretation of the immutable approach as creating a “different employee called Jane Jones that is otherwise the same” is a strawman argument. The immutable approach is actually about representing a function that maps dates or versions onto data. What was Jane’s full name in 2012 and what is it now? As the data changes at discrete points in time this function can be represented by a database table or Map from date or version to data. This is the essence of the “functional” approach.

 

In many circles it is useful or even essential to retain all historical data. In the financial and insurance industries this is usually a legal requirement. Everywhere else it can greatly simplify testing and validation. For example, when we wrote stock exchange software it was theoretically possible to compress repeated trades into a single larger trade by overwriting the data in the database but this was completely prohibited because the regulatory authorities required the ability to see individual trades. When we wrote pension fund calculation software for the insurance industry the system was legally required to be able to rerun old calculations and obtain the exact same answer that it had given 10 years before. In both circles we found people building databases using the imperative approach that Eric Lippert is advocating and then struggling to regain old data. Their solution was often to mutate the database equivalent of an undo buffer in order to retain the ability to regenerate old data. Suffice to say, this approach is very error prone.

 

For most of the people most of the time a more “functional” approach to database updates is preferable. Key any data that you are tempted to mutate by global version number. When searching, search for the latest version. When updating the database, read the latest version and write with the next version number. You can maintain a separate table mapping dates to version numbers. If you are running a concurrent system, obtain the current date using techniques like Lamport’s clock or vector clocks. Then when you want to examine historical data you can fetch the appropriate version of the data from the database, getting the version from the date if necessary. Performance will be slightly worse due to the extra data but any operations on historical data are much easier with this approach.

 

I am currently struggling to do some analytics for a market research company. They have a comprehensive relational database of people, companies, products and services. Each person works for a company. If a person changes jobs their old job is overwritten with their new job. If a person’s address changes, their old address is overwritten by their new address. So our computers once knew where people used to work but that information is not readily available to me precisely because the data was overwritten. So I cannot calculate the social networks people are in or estimate how their work might be cross-pollinated between different companies. I cannot even tell someone where the product they ordered 6 months ago was sent because their address has changed since. So even in a situation where historical data is not legally required it would still have been very useful!