Cassandra非规范化数据模型

问题描述:

我读到在nosql(例如cassandra)数据通常存储非规范化。例如,请参阅此 SO 答案或此网站

I read that in nosql (cassandra for instance) data is often stored denormalized. For instance see this SO answer or this website.

例如,如果您有一个员工和部门的列系列,并且想要执行查询: select * from Emps where Birthdate = '25 / 04/1975'
然后,您必须创建一个列族birthday_Emps,并将每个员工的ID存储为列。所以,你可以查询birthday_Emps系列的钥匙'25 / 04/1975',并立即获得该日期出生的员工的所有ID。

An example is if you have a column family of employees and departments and you want to execute a query: select * from Emps where Birthdate = '25/04/1975' Then you have to make a column family birthday_Emps and store the ID of each employee as a column. So then you can query the birthday_Emps family for the key '25/04/1975' and instantly get all the ID's of the employees born on that date. You can even denormalize the employee details into birthday_Emps as well so that you also instantly have the employee names.

这是真正的方式吗?


  1. 每当员工被删除或插入时,您都必须将该员工从birthday_Emps中删除。在另一个例子中,有人甚至说,有时你有一种情况,其中一个删除在某些表需要像其他表中的100的删除。这是真的很常见吗?

  1. Whenever an employee is deleted or inserted then you will have to remove the employee from birthday_Emps too. And in another example someone even said that sometimes you have a situation where one delete in some table requires like 100's of deletes in other tables. Is this really common to do?

在应用程序代码中进行连接是否很常见?您是否有软件允许您创建预先编写的应用程序以将来自不同查询的数据合并在一起?

Is it common to do joins in application code? Do you have software that allows you create pre-written applications to join together data from different queries?

是否有处理这些数据的最佳做法,模式等模型问题?

Are there best practices, patterns, etc for handling these data model questions?


基于查询的数据建模方法真的是最好的方法。

"Yes" for the most part, taking an approach of query-based data modeling really is the best way to do it.


  1. 这仍然是一个好主意,因为你的查询时间的速度使它值得。是的,有一个更多的大扫除。我没有执行100s从其他列族的删除,但偶尔有一些复杂的清理。但是,你不应该在Cassandra中进行大量的删除(反模式)。

  1. That is still a good idea to do, because the speed of your query times make it worth it. Yes, there's a little more housecleaning to do. I haven't had to execute 100s of deletes from other column families, but occasionally there is some complicated clean-up to do. But, you shouldn't be doing a whole lot of deleting in Cassandra anyway (anti-pattern).

否。客户端JOIN与分布式JOIN一样糟糕。整个想法是创建一个表以返回每个特定查询的数据...反规范化和/或复制...因此,否定完全不需要执行JOIN。例外情况是,如果您正在运行OLAP查询以进行分析,则可以使用像Apache Spark这样的工具来执行专门的分布式JOIN。但

No. Client-side JOINs are just as bad as distributed JOINs. The whole idea is to create a table to return data for each specific query...denormalized and/or replicated...and thus negating the need to do a JOIN at all. The exception to this, is if you are running OLAP queries for analysis, you can use a tool like Apache Spark to execute an ad-hoc, distributed JOIN. But it's definitely not something you'd want to do on a production system.

我可以推荐几篇文章:

  • Getting Started with Cassandra Time Series Data Modeling - Written by DataStax's Chief Evangelist Patrick McFadin, it covers one of the more common Cassandra use cases in a few different ways.
  • Escaping From Disco-Era Data Modeling - This one talks about some of the obstacles that beginners with Cassandra can face, as well as the general approach to take in overcoming them. Disclaimer: I am the author.
  • Cassandra Data Modeling Best Practices, Part 1 - You can't go wrong with Jay Patel's (eBay) classic article on Cassandra modeling practices. It's a little dated in that the examples are grounded in the pre-CQL world, but the techniques still resonate.