What is Database Sharding?
Your application is growing. It has more active users, and more features, and generates more data every day. Your database is now becoming a bottleneck for the rest of your application. Database sharding could be the solution to your problems, but many do not have a clear understanding of what it is and, especially, when to use it. In this article, we’ll cover the basics of database sharding, its best use cases, and the different ways you can implement it.
What is Database Sharding?
Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, increasing the total storage capacity of the system. See more on the basics of sharding here.
Similarly, by distributing the data across multiple machines, a sharded database can handle more requests than a single machine can.
Sharding is a form of scaling known as horizontal scaling or scale-out, as additional nodes are brought on to share the load. Horizontal scaling allows for near-limitless scalability to handle big data and intense workloads. In contrast, vertical scaling refers to increasing the power of a single machine or single server through a more powerful CPU, increased RAM, or increased storage capacity.
How does Database Sharding work?
Database sharding divides the entire dataset into multiple groups known as shards. Once divided, each shard can be stored independently, usually on multiple servers, which are often referred to as a cluster. Each shard can be accessed independently, which means you can access data faster, and have more resources available for processing, computing, and storage.
Why is Database Sharding important?
As an application grows, the number of application users and the amount of data it stores increase over time. The database becomes a bottleneck if the data volume becomes too large and too many users attempt to use the application to read or save information simultaneously. The application slows down and affects the customer experience. Database sharding is one of the methods to solve this problem because it enables parallel processing of smaller datasets across shards.
Advantages of sharding
Sharding allows you to scale your database to handle the increased load to a nearly unlimited degree by providing increased read/write throughput, storage capacity, and high availability. Let’s look at each of those in a little more detail.
- Increased read/write throughput — By distributing the dataset across multiple shards, both read and write operation capacity is increased as long as read and write operations are confined to a single shard.
- Increased storage capacity — Similarly, by increasing the number of shards, you can also increase overall total storage capacity, allowing near-infinite scalability.
- High availability — Finally, shards provide high availability in two ways. First, since each shard is a replica set, every piece of data is replicated. Second, even if an entire shard becomes unavailable since the data is distributed, the database as a whole still remains partially functional, with part of the schema on different shards.
Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query result compilation, complexity of administration, and increased infrastructure costs.
- Query overhead — Each sharded database must have a separate machine or service which understands how to route a querying operation to the appropriate shard. This introduces additional latency on every operation. Furthermore, if the data required for the query is horizontally partitioned across multiple shards, the router must then query each shard and merge the result together. This can make an otherwise simple operation quite expensive and slow down response times.
- The complexity of administration — With a single unsharded database, only the database server itself requires upkeep and maintenance. With every sharded database, on top of managing the shards themselves, there are additional service nodes to maintain. Plus, in cases where replication is being used, any data updates must be mirrored across each replicated node. Overall, a sharded database is a more complex system that requires more administration.
- Increased infrastructure costs — Sharding by its nature requires additional machines and compute power over a single database server. While this allows your database to grow beyond the limits of a single machine, each additional shard comes with higher costs. The cost of a distributed database system, especially if it is missing the proper optimization, can be significant.
Now let’s quickly understand the different architectures of Database Sharding. Various types of Database Sharding Architectures can be chosen depending on your requirements as follows:
- Directory-Based Sharding: This type of architecture uses lookup tables for keeping track of the data in a Database Shard. The main function of the lookup table is to give the exact information of the data stored in the database. This architecture gives more flexibility in finding out the range of values in the lookup table or creating shards based on algorithms, and so on. The main drawback of this type of architecture is that it needs to consult a lookup table to find the concerned data for every single execution of the query. Also, the whole system may lead to failure if any of the lookup tables crash as the entire architecture cannot function without it.
- Key-Based Sharding: This type of sharding is also known as Hasing-based sharding which uses Hashing concept. It uses the key-value pairs to store the values. Every key has unique values that are different from each other. In this architecture, the function of hash is used to map every row to its Shard by taking some data values from the row and then mapping that unique value to the Shard where data needs to be stored. If you are finding the location of the Shard from the data, you need to be slightly concerned about finding the data in the Shards. In other types of architecture, you need to keep track of data in the Shards.
- Range-Based Sharding: In this Database architecture, a lookup table is used to find the database shard by looking at the data. The lookup table comprises a range of values as well as the id of Shard. If the data needs to be stored under a specific range of a Shard, it will be stored under that particular Shard only. This may lead to an unbalanced spread of data in some cases because the frequencies/occurrences of the Shards may be more as compared to others. Ex. Price value may be stored according to the Shard’s ranges of values.
- Geo-Based Sharding: This type of architecture is almost similar to the above type of sharding. In this type of sharding, user location or region is used by the data to process a shard. Ex. Tinder Application uses this type of sharding.
When should you consider Database Sharding?
Sharding can be used in multiple cases, such as:
- If the amount of your application data is growing rapidly and you run out of capacity on a single server, sharding can help you distribute the load across multiple servers and increase capacity.
- If the number of reads or writes to your database exceeds what a single node or its read replicas can handle, you will observe slowed response times or timeouts. Sharding can help you distribute the load and improve performance by allowing each shard to be optimized for specific queries or workloads.
- Similarly, you can also face slowed response times or timeouts when the network bandwidth needed by the application surpasses the bandwidth available to a single database node and any read replicas.
Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.
By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.