什么是分片,为什么它很重要?

问题描述:

我认为我理解分片是把你的分片数据(分片)放到一个容易处理的聚合中,在上下文中是有意义的。它是否正确?

I think I understand sharding to be putting back your sliced up data (the shards) into an easy to deal with aggregate that makes sense in the context. Is this correct?

更新:我想我在这里很苦苦。在我看来,应用层应该没有业务决定数据应该存储在哪里。最多应该是某种类型的碎片客户端。两个回答都回答了什么,但不是为什么是重要的方面。在明显的性能增益之外有什么影响?这些增益是否足以抵消MVC违规?

Update: I guess I am struggling here. In my opinion the application tier should have no business determining where data should be stored. At best it should be shard client of some sort. Both responses answered the what but not the why is it important aspect. What implications does it have outside of the obvious performance gains? Are these gains sufficient to offset the MVC violation? Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?

分片只是另一个名称,水平分割。

Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.

Wikipedia


水平分割是一种设计原则, ,而不是按列分割(对于归一化)。每个分区形成分片的一部分,分片又可以位于单独的数据库服务器或物理位置上。优点是每个表中的行数减少(这减少索引大小,从而提高搜索性能)。如果分片基于数据的一些现实世界方面(例如,欧洲客户与美国客户),那么可以容易且自动地推断适当的分片成员资格,并且仅查询相关分片。

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

有关分片的更多信息:


服务器是相同的,具有相同的表结构。其次,数据记录在逻辑上被分割在分片数据库中。与分区数据库不同,每个完整的数据记录只存在于一个分片中(除非有备份/冗余的镜像),只在该数据库中执行所有CRUD操作。您可能不喜欢所使用的术语,但这代表了将逻辑数据库组织成更小部分的不同方式。

Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database. You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.

更新:您不会打破MVC。确定正确分片在哪里存储数据的工作将由您的数据访问层透明地完成。在那里,您必须根据您用于分割数据库的条件确定正确的分片。 (因为您必须根据应用程序的某些具体方面手动将数据库分成一些不同的分片。)然后,在从数据库加载和存储数据以使用正确的分片时,必须小心。

Update: You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.

也许这个使用Java代码的示例使它更清晰(这是关于 Hibernate Shards 项目)

Maybe this example with Java code makes it somewhat clearer (it's about the Hibernate Shards project), how this would work in a real world scenario.

为了解决为什么分片:这主要是仅适用于具有批次数据的超大规模应用程序。首先,它有助于最小化数据库查询的响应时间。第二,你可以使用更便宜的低端机器来托管你的数据,而不是一个大服务器,这可能不再足够了。

To address the "why sharding": It's mainly only for very large scale applications, with lots of data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.