Database Sharding Basic Idea and Sharding Strategies

This article focuses on the basic idea and theory sharding sharding strategy.

The Basic Idea

The basic idea is to make a database Sharding cut into multiple parts into different databases (server), thereby alleviate performance problems single database. Is not strictly speaking, mass data for the database, if it is because the data table and more and more, this time for using a vertical sharding, that close relationship (such as the same module) sharding table out on a server. If the table is not much, but the data of each table is very large, this time for the level of sharding, ie the data table by certain rules (such as by ID hash) cut into multiple databases (server) on. Of course, in reality, these two cases is more mixed together, this time the need to make a selection according to the actual situation, may also be integrated with the vertical and horizontal sharding, which will cut into the existing database, as a matrix-like expanded infinitely database (server) array. The following describes in detail what are the vertical and horizontal sharding sharding.

The maximum vertical sharding feature is a simple rule, the implementation is also more convenient, especially for the coupling between the business of non-
Often low, the interaction is very small, the business logic is very clear system. In such a system, can be done easily with different industry
Table Service module used spin to a different database. Depending on the split table, also on the application
Smaller, splitting rules will be relatively simple and clear. (This is the so-called "share nothing").

                                                              
                                                              
  +----------------------------------------------------------+
  |                   Application Servers                    |
  +---+------------------------------------------------+-----+
      |                                                |
      |            +------------------------+          |
      |            |       Single DB        |          |
      |            +------------+-----------+          |
      |                         |                      |
      |                         |                      |
      |                Vertical | Sharding             |
      |                         |                      |
      v <-----------------------+--------------------> v
  +---------+ +----------------+ +--------------+ +----------+
  |  Users  | | Group Messages | | Photo Albums | |  Events  |
  +---------+ +----------------+ +--------------+ +----------+
                                                              
                                                              

Level of sharding compared to the vertical sharding, relatively speaking, a little more complex. Because you want a table with different data split
Assigned to different databases, for applications, compared to split itself in accordance with the rules of the table name to split more complicated, after
Also some of the more complex data maintenance.

                                                              
                                                              
  +----------------------------------------------------------+
  |                   Application Servers                    |
  +-----+-------------------------------------------------+--+
        |                          +----------------+     |
        |                          | user id & n=0  |     |
        |                          +----------------+     |
        v                          | user id & n=1  |     |
   +-----------+                   +----------------+     |
   |           |     Horizontal    | user id & n=2  |     |
   |           |      Sharding     +----------------+     |
   | Single DB +------------------>| user id & n=3  |<----+
   |           |                   +----------------+
   |           |                   | user id & n=4  |
   +-----------+                   +----------------+
                                   |       ...      |
                                   +----------------+
                                                              
                                                              

Let us consider the general case of data sharding: on the one hand, all the tables in a database is usually not possible by a whole series up a table, the implied meaning of this sentence is that almost all level sharding for a small rub a little twist (in fact, points out vertical shear block) close relationship between tables, and can not be carried out for all tables. On the other hand, some very high system load, even if just a single table are unable to bear the load through a single database host, which means that just can not be completely vertical sharding solved and ask. Therefore, most systems will be vertical and horizontal sharding sharding used in combination, the first vertical sharding of the system to do, and then do selective level sharding for the case every little twist of the table. So that the entire database is cut into a distributed matrix.

                                                              
                                                              
 +--------------------------------------------------------------------+
 |                        Application Servers                         |
 +---------+------------------------------------------------+---------+
           |                                                |
           |            +------------------------+          |
           |            |       Single DB        |          |
           |            +------------+-----------+          |
           |                         |                      |
           |                         |                      |
           |                Vertical | Sharding             |
           |                         |                      |
           v <-----------------------+--------------------> v
 +---------------+ +----------------+ +----------------+ +------------+
 |     Users     | | Group Messages | |  Photo Albums  | |   Events   |
 +-------+-------+ +------+---------+ +--------+-------+ +------+-----+
         |                |                    |                |
         |                | Horizontal Sharding|                |
         v                v                    v                v
 +---------------+ +----------------+ +----------------+ +------------+
 | user id & n=0 | | group id & n=0 | | album id & n=0 | |   time<0   |
 +---------------+ +----------------+ +----------------+ +------------+
 | user id & n=1 | | group id & n=1 | | album id & n=1 | | n<time<2n  |
 +---------------+ +----------------+ +----------------+ +------------+
 | user id & n=2 | | group id & n=2 | | album id & n=2 | | 2n<time<3n |
 +---------------+ +----------------+ +----------------+ +------------+
 | user id & n=3 | | group id & n=3 | | album id & n=3 | | 3n<time<4n |
 +---------------+ +----------------+ +----------------+ +------------+
 |      ...      | |       ...      | |       ...      | |     ...    |
 +---------------+ +----------------+ +----------------+ +------------+
                                                                      
                                                                      

Sharding Strategy

As previously mentioned, the first step of sharding is based on the level of the vertical slit and then carried sharding. Results exactly vertical to horizontal sharding sharding and prepared the way. The idea is vertical sharding analysis aggregation relationship between tables, the close relationship between the tables together. In most cases may be the same module, or the same "aggregation." Here the "gathering" is the domain-driven design in the said gathering. In the vertical separation of the table cut gathered to identify the "root element" (where "root element" is the domain-driven design in the "aggregate root"), press the "root element" horizontal sharding, which is from the "root elements "began, all the data and its direct and indirect association into a shard inside. Thus the possibility of cross-shard associated appear very small. Applications do not have to interrupt the association between both tables. For example: For social networking sites, almost all of the data will eventually be associated to a user, based on user sharding is the best choice. Another example is the forum systems, users and forum two modules should be sharing in the vertical cut shard was in the two years, the Forum module for it, Forum apparently aggregate root, so the horizontal sharding by Forum, the Forum where all posts and replies are placed in a shard with the Forum where it is natural.

For shared data in the data, if it is read-only dictionary tables, each shard in the maintenance of a supposed to be a good choice, so do not have to interrupt the association. If you are associated with general data across nodes between, we must break.

Of particular note are: When both vertical and horizontal cut-sharing, sharding strategy will be some subtle changes. For example: In considering only the vertical sharding of the time, is divided into an arbitrary relationship can be maintained between the tables together, so you can press the "functional modules" partition table, but once introduced level sharding, relationship between tables on will be greatly restricted, usually only allow the primary table (table ID in the hash table), and to preserve the association between the plurality of sub-tables thereof, that is: when both vertical and horizontal shear sharing in the vertical direction sharding will no longer be a "functional modules" to be divided, but the need for more fine-grained vertical sharding, and this granularity and domain-driven design in the "convergence" concept coincide, even said to be exactly the same, each shard in the primary table is an aggregate root aggregation! Such sharding points down you will find database is cut too distracted subdivision (number shard will be more, but not many shard in the table), in order to avoid excessive data source management, make full use of each database server resources that can be considered similar to the business, and with similar data growth rate (the amount of data in the main table on the same order of magnitude) of two or more of the shard into the same data source, each shard is still independent, they have respective main table, and the independent use of each hash table ID, except that their hash modulo (i.e., number of nodes) required is the same.

Affairs Issues

There are two problems to solve matters feasible solution: distributed transactions and through the application and database transactions together to achieve the following two sets of programs to control a simple comparison.

Using Distributed Transaction

  • Advantages: by database management, simple and effective.
  • Disadvantages: high cost performance, especially a growing number shard.

By jointly controlled applications and databases

  • Principle: a distributed transaction across multiple databases at only split into multiple
    In a single database above a small business, and through the application to the total control
    Various small business.
  • Advantages: There are advantages in performance.
  • Disadvantages: Requires an application designed to make the transaction control flexibility. If you use the spring of transaction management, change it will face some difficulties.

Cross-node Join issues

As long as they were slicing across nodes Join problem is inevitable. But good design and slicing it can reduce the occurrence of such situations. To solve this problem is common practice to achieve twice the query. In the results of the first set of the query to find related data id, id get the associated data to initiate a second time according to these requests.

Cross-node count, order by, group by and aggregate functions problem
These are a class of problems, because they are based on the entire set of data needs to be calculated. Most agents are not automatically handle merging. Solution: Similar to solve the problem of inter-node join, get the end result of the merger in the application separately on each node. And join a different query each node can be executed in parallel, so many times it is much faster than the speed of a single large table. However, if the result set is large, the application memory consumption is a problem.

Database Sharding Basic Idea and Sharding Strategies
3 votes, 4.67 avg. rating (93% score)