想要得到,就得付出。
You gotta push if you wanna pull

原始链接: https://www.morling.dev/blog/you-gotta-push-if-you-wanna-pull/

## 数据复制与一致性:摘要 如果单个“规范实例”作为记录系统——唯一更新来源——数据复制并非混乱。其他数据视图由此来源衍生,接受**最终一致性**。 虽然所有视图的即时同步不切实际,但可以使用时间戳等技术来管理一致性需求。挑战在于,当源数据发生变化时,如何有效地更新这些视图。从头开始重建视图是浪费的;理想情况下,应该只更新*变更*的部分。 **推送查询**提供了一种解决方案。它们由源数据修改触发,仅发出必要的更新,并将处理成本分散到一段时间内。这对于增量更改是高效的。然而,对于大规模数据迁移(批量删除、历史数据回填),完整的**拉取查询**重新计算可能更有效,避免了数百万小更新的开销。 最终,推送查询和拉取查询之间的选择取决于数据更改的规模和性质。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 你必须付出才能获得 (morling.dev) 5 分,由 ingve 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Now, the thought of duplication might trigger some reservations: isn’t chaos going to ensue if there are multiple copies of the same data set? Which version is the right one? Aren’t you at the risk of producing inconsistent query results? I think there’s not much to fear if you keep one fundamental principle in mind: there should be exactly one canonical instance of the data set. As the system of record it is the only one that gets mutated by business transactions. All other views of the data set are derived from this one, i.e. it is the source of truth.

In practice, it is not feasible to update all derived views synchronously, in particular if they are located in another system. This means consumers need to account for eventual consistency of view data. For many use cases, such as analytics, that is perfectly acceptable. Other situations might have stronger consistency requirements, making it necessary to prevent stale data from being retrieved from a view. Different techniques exist for doing so, such as tracking logical timestamps or log sequence numbers (LSNs).

This raises the question of how to keep all these different flavors of materialized views in sync, as the original data set changes. New records will be added, existing ones be updated or removed, and all the derived views need to be updated in order to reflect these changes. You could periodically recreate any derived views from scratch, but not only might this be a very costly operation, you’d also have to deal with outdated or incomplete query results very quickly again.

Thinking about it, recomputing materialized views from scratch can be pretty wasteful. Typically, only small parts of a dataset change, hence also only small parts of any derived views should require to be updated. Intuitively, this makes a lot of sense. For instance, assume you’d want to keep track of the revenue per product category across the purchase orders in your system. When a new order arrives, would you recalculate the totals for all the categories, by processing all the orders? Of course not. Instead, you’d keep the totals of all the unrelated categories as-is. Only the total of the incoming order’s category needs updating, and you’d compute that one by simply adding the new order’s value to the previous total.

This is exactly how push queries work. Triggered by changes to rows in the source tables they operate on, they’ll emit new (or updated) results reflecting exactly these changes. A new row in the purchase orders table in the previous example will yield exactly one update to the sum of that order’s category. That way, push queries solve the concern of pull queries potentially being too costly and taking too long to run. As they operate on small incremental data changes, the cost is distributed over time, and each output update can be calculated very quickly.

A core assumption of push queries is that the delta they operate on is comparatively small. If there is a massive data change—for instance when doing bulk deletes, or when backfilling historical data—instead of processing all these changes incrementally, triggering millions of state updates, each with its own overhead (lookups, partial aggregations, downstream propagation), it may be advantageous to fall back to a pull query processing the complete data set.

联系我们 contact @ memedata.com