Cassandra反规范化数据模型

问题描述:

我读到在nosql(例如cassandra)数据通常存储非规范化。例如,此 SO 答案或此网站。

I read that in nosql (cassandra for instance) data is often stored denormalized. For instance this SO answer or this website.

示例给出,如果你有一个列的员工和部门的家庭,并且你想执行一个查询:select * from Emps where Birthdate = '25 / 04/1975'
然后你必须创建一个列族birthday_Emps和将每个员工的ID存储为列。所以,你可以查询birthday_Emps系列的钥匙'25 / 04/1975',并立即获得该日期出生的所有员工的ID。你甚至可以将员工的详细信息标准化为birthday_Emps,这样你也可以即时获得员工姓名。

An example give there is if you have a column familie of employees and departments and you want to execute a query: select * from Emps where Birthdate = '25/04/1975' Then you have to make a column family birthday_Emps and store the ID of each employee as a column. So then you can query the birthday_Emps family for the key '25/04/1975' and instantly get all the ID's of the employees born on that date. You can even denormalize the employee details into birthday_Emps as well so that you also instantly have the employee names.

这是真的吗?


  1. 每当删除或插入员工时,您都必须将该员工从birthday_Emps中删除。在另一个例子中,有人甚至说,有时你有一种情况,其中一个删除在某些表需要像其他表中的100的删除。这是真的很常见吗?

  1. Whenever an employee is deleted or inserted then you will have to remove the employee from birthday_Emps too. And in another example someone even said that sometimes you have a situation where one delete in some table requires like 100's of deletes in other tables. Is this really common to do?

在应用程序代码中进行连接是否很常见?您是否有软件允许您创建预先编写的应用程序以将来自不同查询的数据合并在一起?

Is it common to do joins in application code? Do you have software that allows you create pre-written applications to join together data from different queries?

是否有处理这些数据的最佳做法,模式等模型问题?

Are there best practices, patterns, etc for handling these data model questions?


基于查询的数据建模方法真的是最好的方法。

"Yes" for the most part, taking an approach of query-based data modeling really is the best way to do it.


  1. 这仍然是一个好主意,因为你的查询时间的速度使它值得。是的,还有一点大事要做。我没有执行100s从其他列族的删除,但偶尔有一些复杂的清理。但是,你不应该在Cassandra中进行大量删除(反模式)。

  1. That is still a good idea to do, because the speed of your query times make it worth it. Yes, there's a little more housecleaning to do. I haven't had to execute 100s of deletes from other column families, but occasionally there is some complicated clean-up to do. But, you shouldn't be doing a whole lot of deleting in Cassandra anyway (anti-pattern).

否。客户端JOIN与分布式JOIN一样糟糕。整个想法是创建一个表,以返回每个特定查询的数据...反规范化和/或复制...,从而否定完全JOIN的需要。例外情况是,如果您正在运行OLAP查询以进行分析,则可以使用像Apache Spark这样的工具来执行专门的分布式JOIN。但

No. Client-side JOINs are just as bad as distributed JOINs. The whole idea is to create a table to return data for each specific query...denormalized and/or replicated...and thus negating the need to do a JOIN at all. The exception to this, is if you are running OLAP queries for analysis, you can use a tool like Apache Spark to execute an ad-hoc, distributed JOIN. But it's definitely not something you'd want to do on a production system.

我可以推荐几篇文章:

  • Getting Started with Cassandra Time Series Data Modeling - Written by DataStax's Chief Evangelist Patrick McFadin, it covers one of the more common Cassandra use cases in a few different ways.
  • Escaping From Disco-Era Data Modeling - This one talks about some of the obstacles that beginners with Cassandra can face, as well as the general approach to take in overcoming them. Disclaimer: I am the author.
  • Cassandra Data Modeling Best Practices, Part 1 - You can't go wrong with Jay Patel's (eBay) classic article on Cassandra modeling practices. It's a little dated in that the examples are grounded in the pre-CQL world, but the techniques still resonate.