As developers we make decisions every day about what is the best way to do something. But the definition of best is a really hard one to put your finger on. But you should, and do so at the start of every project and every fork in the road. If you don't detailed technology decision will go unanswered and by default you'll end up in trouble.
If, for example, a development team chooses a programming language for a new project, they typically have a number of considerations they take into account. Such as dynamic or static, functional or imperative, strongly or loosely typed, popularity, community and performance. These considerations get traded off against each other based on what’s the best fit for the team and the problem.
As a case in point some teams would still choose Ruby for their project even though its commonly known that Ruby is not the fastest greyhound in the kennel. Many of the benchmarks agree (http://benchmarksgame.alioth.debian.org/, https://attractivechaos.github.io/plb/) consider it to be "slow". So why do so many companies use it? The answer is two reasons. Firstly, its an amazingly expressive, productive, powerful, and dynamic language. And secondly, it’s fast enough!
Expressive outweighs speed
For most real world situations, most languages, including Ruby, will perform just fine. Of course, if you have millions and billions of users and/or gigas and petas of data - it might not be greyhound to place your bet on. But if thats true no language is really going to save you, its how you put it all together that would count the most.
In most situations teams usually fall on the right side and properly consider the tradeoffs between correctness, productivity, testability etc., over the thing with the highest performance, just because.
Except in one situation
The one place I have found where teams don't seem to consider the tradeoffs as well is when aggregating data. In this case the instinct is to write a database query that mashes all the data together and aggregates the results. These aggregating queries quickly become uber monster queries (UMQs) containing hundreds of lines of SQL and nested down to dozens of conditional branches.
UMQs also rarely have good test coverage (because testing SQL is hard) which is ironic because these queries are often the logic behind critical roll-up reports from which businesses will make their key decisions.
All our good coding practices just went out the window.
Why isn’t the same pattern of decision making followed when it comes to data aggregation? Is it because database queries are faster? Is it because databases are good at aggregation? I think we have all been drinking the database kool-aid for too long. Although these points may be true, let me ask another question: Is aggregating your data in code fast enough and have you tested it?
Recently, I worked on a system that had a lot of stored procedures. Whether they were designed originally for speed, is not clear, but what was is that they were no longer performing. One report was taking several minutes to return results for a large data set that resulted in the user getting a blank report as the http request had timed out. I was tasked with speeding up this report and ensuring that it worked for our largest clients.
UMQ - An uber monster query
After looking at the stored procedures behind the report it was clear it was a UMQ, inefficient, and a rewrite of some kind was necessary.
I could have gone two ways. Re-factor it to be more performant, or move it to code. Given that there were no tests around it, I didn't have a safety net, so decided to move it into code to see if I could make it both testable and maintainable and at the same time achieve the performance requirement.
The approach of moving the core of the report aggregation into the application code is a relatively simple one. First execute a number of simple queries, then piece together the data sets within your code building up a domain model. Then finally the domain model can perform operations on its data to do the desired calculations.
To prove that this approach was viable I needed to first check that it was possible to retrieve the raw data within the desired performance envelope. This involved running each query separately in a query tool and checking it returned results fast enough. One thing to keep in mind is you only want to return columns that you need. If you’re going to be returning 100k rows, having additional columns will affect the performance.
The next step is to verify that pulling the data into your application can also happen in a timely manner. I found that my queries ran fast enough in the query tool, but the same was not true in the application where the time blew out to 20 times slower resulting in not being within the performance window. After some investigation it was found that converting a SQL date to a java.sql.Date was incredibly slow and the cause of the issue. I resolved this issue by returning the date from the query as an integer then transforming into a java.util.Date in the getter for the attribute.
Now that the application has all the raw data, it now needs to be arranged into a usable form. Typically a class is used to orchestrate getting all the data that would then either pass the datasets into the domain model constructor or use a builder to construct the domain model. Ultimately the domain model will be the piece that will contain the calculation business logic. It will perform these calculations on the data that it encapsulates.
Part of constructing the domain model is turning the raw data into a more usable data structures for performing calculations. The size of the data and the types of calculations required will dictate what this will look like. E.g. you may need to put the data into a number of hash maps to allow quick access to groups of data.
You need to be very careful of n+1 issues (http://www.infoq.com/articles/N-Plus-1), so avoiding iterating through data more than necessary and use data structures that allow fast record access.
Now all that’s left is to write the calculation methods that use the encapsulated data. Something about writing these methods just feels right. It may be due to having the ability to easily test and implement the business rules and that feeling of safety of having tight tests around your business calculations. You don’t really get this from writing and testing a query, as it never has that level of coverage or transparency as to what’s actually being tested.
There was massive gain for moving these calculations into the code. Not only did the code perform better, the code was fully tested and was much more readable, maintainable and scalable. Because of the tests around this functionality it was now very clear what the expectations were and how values were calculated.
I believe it’s important to apply the same decision making process to all aspects of application development. Don’t make assumptions about things being too slow until you actually test that it’s true. Processing 100k records in code is fine provided you’re careful about how you do it.