Tuesday 4 March 2014

The Importance of a Good Shard Key

In MongoDB (and many other database solutions) scalability is achieved by dividing your database into a number of smaller portions. Each portion is stored on a separate set of database nodes. The theory is that this approach allows your database to scale horizontally beyond the capacity of a single node and allows load to be spread more evenly across multiple clusters of nodes. In MongoDB this approach is known as sharding (many other databases call a similar concept partitioning).

When you deploy shards in MongoDB you have to select a field (or fields) that are present in each document as the shard key. The value of this field determines which set of database nodes holds a specific document. If you get this key selection wrong then the impacts on your system can be huge.

As an example of how shard key selection is vital, consider this real-world, non-IT example, which clearly highlights how a poorly selected shard key can massively impact performance of a system:

At the weekend I participated in a large Half Marathon event, with about 20,000 other runners. The number of bags to be stored while the race was on was too large for a single baggage tent, so the organisers had wisely decided to introduce a sharded approach by having two tents, each holding half the bags. Additionally, each tent was further sub-divided into separate evenly-sized sharded collections of bags, each managed by its own team of helpers.

Now, as a shard key the organisers had decided to use race number: a seemingly sensible choice given that this would be the single unique piece of information that every runner would have. The bag tents were therefore arranged so that tent 1 was for numbers 1 to 10,000 and tent 2 was for numbers 10,001 to 20,000. The sub divisions inside the tents were further broken down into 1000 number blocks (i.e. 1 to 1000, 1001 to 2000 and so on).

For pre-race bag drop off this sharding approach worked really well (the write scenario). Runners dropping off bags were nicely distributed across both time and the full range of values, so each tent and sub-division could work in parallel for the maximum write performance. This was clearly an excellent shard key selection for the write case.

The problem occurred at the end of the race when runners returned to the tents to collect their bags (the read scenario). Now, the organisers of the race had issued the numbers based on predicted finish time (1 for the fastest predicted finisher, 20,000 for the slowest): the result being that runners finished roughly in number order.

So, what happened then is that there was a initially massive read queue at tent 1 for the 1 to 2000 numbers, while the other shard nodes were almost completely idle. Then the queue moved to the 2001 to 4000 shards, and so on. Towards the end of the race, tent 1 was idle while tent 2 now had the read queues as the later numbered runners all finished.

We have a perfect case of a shard key that seems quite sensible by design but is actually fundamentally broken by the usage scenario of the system, in this case the (near) sequential arrival of numbers for retrieving the bags.

So, how can we solve this? Issuing the race numbers in a different order is one option, but a sequential numbering system based on predicted finish time is quite sensible for many other race organisation requirements. A better option is to change the shard key used by the baggage tents.

Basing the shard key around race number is still a good approach as this is the one piece of information guaranteed to be unique for each runner and the range of possible values is clearly defined and well distributed. Bag drop off (write performance) is never going to be a problem as runners will arrive fairly well distributed across both time and the race number range.

The challenge is coming up with a way of better distributing the bag retrieval (read performance). Fortunately with a sequential numbering system this is pretty easy. Rather than shard by the whole number, just shard by the last digit of the number: tent 1 – numbers ending 0 to 4; tent 2 – numbers ending 5 to 9. Then in each tent further divide into five separate shards, one for each ending digit. If finer granularity is required then within each shard it should be easy enough to sort and index by full race number.

Selecting a good shard key is hard. Considering both the write and read scenarios, and how the shard key will be utilised within these scenarios is critical. It can make a huge difference between a well-functioning system and one that fails to gain the benefits of the sharded approach.