By Suvorov Nikolai (nmsuvorov@edu.hse.ru)
Any application must meet various requirements to be useful and valuable. Functional requirements define what an application should do, including allowing data to be stored, modified, and retrieved, while nonfunctional requirements define certain properties of the application, including security, reliability, maintainability, scalability, availability, etc. Most application functional requirements can be satisfied by any of the currently used data storages, whereas nonfunctional requirements can only be satisfied with the correct choice of storages to be used within the application.
According to [1], main nonfunctional requirements which influence the choice of the data storages include:
Every database type has its own field of appropriate usage and its own reasons to use. The usage of specifically distributed databases may have the following reasons:
Such distributed systems mainly implement sharing-nothing architectures, described in detail in [2], and are usually based on replication and sharding (partitioning) approaches which are described in the following sections. The sections give information about the approaches themselves and discuss benefits and potential risks of their usage.
Replication is an approach which proposes to keep copies of the same data on multiple machines. Replication is mainly used to achieve high availability, sufficient scalability, decrease in latency, and even can allow an application to work during a network interruption [1]. This section describes three main methods implementing this approach and shows their advantages and disadvantages.
The most popular approach, according to [1], is single-leader replication – in this case, only one node is available for write operations while all the nodes can be used for read operations (which significantly increases performance of read requests in comparison with single-machine storages). When a new write operation occurs, it firstly executed by the leader node and then the leader node sends (synchronously or asynchronously) requests for updating specific data to follower nodes. Potentially, reads from followers may be stale in asynchronous replication as it takes time to apply changes to all followers (it is only a temporary state, and such effect is known as eventual consistency [3]).
A more robust approach (in terms of presence of faulty nodes and latency spikes) is multi-leader replication – in this case, a client can send write requests to one of multiple leader nodes, which then send requests for updating data to each other and to the follower nodes. This method increases speed of executing write requests comparing with single-leader replication (as the load on a leader node decreases, whereas write requests can now go to local datacenters, which are closer to the client), but it has one serious drawback: write conflicts can occur if some data is concurrently modified in two different leader nodes, and such conflicts should somehow be resolved (most implementations of such type of replication handle conflicts poorly, so the main approach for dealing with conflicts is simply to avoid them [4]).
Another robust approach is leaderless replication – in this case, there is no leader and follower nodes, and clients send each write request to several nodes and read from several nodes in parallel [1]. Reading and writing operations use quorums to consider these operations successful [5]; such quorums are also used for repairing data using a read repair mechanism – all the replicas which have stale versions are updated with data obtained through quorum by the read request. However, multiple conflicts can also arise during concurrent write operations as well as during read repair; some techniques to resolve them are mentioned in [1], but it is still recommended to avoid such conflicts if it is possible.
Although making a copy of data seems as an easy task, many aspects should be considered including concurrent writes (mentioned above), read-after-write consistency [6], monotonic reads [7], consistent prefix reads [7], as well as dealing with unavailable nodes and network interruptions. Nevertheless, replication significantly helps to make a data storage robust and able to handle with high volumes of reads and writes. It also allows to place data geographically close to clients, so that users can interact with it faster.
The main goal of sharding (partitioning) is to spread data and query load evenly across multiple machines [1]. Potentially, a large dataset can be distributed across many disks and the query load can be distributed across many processors. Despite sharding benefits, it requires developers to carefully choose an appropriate partitioning scheme for the data as it straightforwardly influences the overall performance. In addition, rebalance techniques must be considered when new nodes are added, or existing nodes are removed. It is also worth mentioning that sharding is usually combined with the replication so that copies of partitions are stored on multiple nodes for fault tolerance.
When a person decides to use sharding, the way of partitioning the data must be the first thing to be considered. Multiple methods can be used for this task, including following:
As is seen, the choice of technique highly depends on the data it is planned to store and on the requests which are going to be executed on the data. When the storage is implemented, new nodes may appear and existing nodes may be removed; that is why, it is also needed to consider rebalancing techniques. Currently, there are three main methods used for this purpose:
The usage of sharding is especially beneficial when the application requires so much data that storing and processing it on a single machine becomes unfeasible. Although it can successfully spread the data and the query load (and even parallelize complex queries across multiple nodes), it is necessary to consider various aspects before implementing it including the appropriate data schema, the rebalancing approach, request routing [1], etc. However, sharding is currently the main approach to deal with huge amounts of data which cannot be stored on a single machine, and when this approach is used correctly, it advantages significantly outweigh disadvantages – especially, when used together with replication, which substantially increases the data storage robustness and reliability.
Despite all the benefits which can be provided by replication and sharding, in my view, it is still better to use them only when nonfunctional requirements insist on scalability, fault tolerance and low latency. The usage of such technologies without a real need will only provide high expenses and may even decrease the overall performance. In addition, dozens of aspects must be carefully considered to make these technologies beneficial for a given application. Nevertheless, with the appropriate usage of such technologies, the application can become more reliable (due to multiple replicas of the same data) and faster (due to spreading the query load and possibility of decreasing latency). Such distributed architectures are becoming more and more feasible even for small companies [1] and can potentially become much more widely spread in future.
But when should developers choose either to use these technologies or not? The Agile methodology, which is continuing to grow in popularity, has the answer for this question: when a specific corresponding requirement arises. Through the Agile sprints developers acquire more and more knowledge of a given domain as well as of the customer’s requirements, and specifically this knowledge and continuous interaction with the customer will provide at a certain point the answer to this question. Anyway, only sufficient knowledge of the domain can help to successfully build a distributed architecture which will provide real benefits.