Now, the thought of duplication might trigger some reservations: isn’t chaos going to ensue if there are multiple copies of the same data set? Which version is the right one? Aren’t you at the risk of producing inconsistent query results? I think there’s not much to fear if you keep one fundamental principle in mind: there should be exactly one canonical instance of the data set. As the system of record it is the only one that gets mutated by business transactions. All other views of the data set are derived from this one, i.e. it is the source of truth.
|
In practice, it is not feasible to update all derived views synchronously, in particular if they are located in another system. This means consumers need to account for eventual consistency of view data. For many use cases, such as analytics, that is perfectly acceptable. Other situations might have stronger consistency requirements, making it necessary to prevent stale data from being retrieved from a view. Different techniques exist for doing so, such as tracking logical timestamps or log sequence numbers (LSNs). |
This raises the question of how to keep all these different flavors of materialized views in sync, as the original data set changes. New records will be added, existing ones be updated or removed, and all the derived views need to be updated in order to reflect these changes. You could periodically recreate any derived views from scratch, but not only might this be a very costly operation, you’d also have to deal with outdated or incomplete query results very quickly again.
Thinking about it, recomputing materialized views from scratch can be pretty wasteful. Typically, only small parts of a dataset change, hence also only small parts of any derived views should require to be updated. Intuitively, this makes a lot of sense. For instance, assume you’d want to keep track of the revenue per product category across the purchase orders in your system. When a new order arrives, would you recalculate the totals for all the categories, by processing all the orders? Of course not. Instead, you’d keep the totals of all the unrelated categories as-is. Only the total of the incoming order’s category needs updating, and you’d compute that one by simply adding the new order’s value to the previous total.
This is exactly how push queries work. Triggered by changes to rows in the source tables they operate on, they’ll emit new (or updated) results reflecting exactly these changes. A new row in the purchase orders table in the previous example will yield exactly one update to the sum of that order’s category. That way, push queries solve the concern of pull queries potentially being too costly and taking too long to run. As they operate on small incremental data changes, the cost is distributed over time, and each output update can be calculated very quickly.
|
A core assumption of push queries is that the delta they operate on is comparatively small. If there is a massive data change—for instance when doing bulk deletes, or when backfilling historical data—instead of processing all these changes incrementally, triggering millions of state updates, each with its own overhead (lookups, partial aggregations, downstream propagation), it may be advantageous to fall back to a pull query processing the complete data set. |