![]() |
|
![]() |
|
I'm curious to hear a bit more of your opinion. For example, I'm surprised that syscall latency is something near the top of your list. I think the usual wisdom in the DB community is that the cost models are mostly fine, but the cardinality estimation is really bad. In terms of the deferred / alternative planning, do you think adaptive query execution is a reasonable way to achieve this? It certainly allows for information early in the query execution to impact later plans. My worry with these approaches is that if you get the first couple of joins wrong (which is not uncommon), unless you have something like Yannakakis/SIPs, you still can't recover. I am obviously biased on the whole "ML for query optimization" thing. One thing I would note is that every "ML for planning" approach I've seen does, under the hood, use ML for cost discovery/estimation. These approaches are just trying to balance the data they collect (exploration) with the quality of the plans they produce (exploitation). Interestingly, if you use ML in a way that is completely removed from planning, you actually get worse query plans despite more accurate estimates: https://people.csail.mit.edu/tatbul/publications/flowloss_vl... (again, I've got a horse in this race, so my opinion should come with a side of salt :D) |
![]() |
|
The author mentions PostgreSQL's JIT compiler. Up to this day, I've only seen it degrade the performance of queries. Disabling it is on my install checklist.
|
![]() |
|
The JIT compiler is great for analytical queries. You can configure thresholds for the JIT activation in PostgreSQL as well if you want to elevate the bar from which the JIT is enabled. |
![]() |
|
Postgres is a fairly standard System R implementation. You convert your SQL into a join tree with some operations at the end, and then try various access paths and their combinations.
|
![]() |
|
PostgreSQL Query Optimization: The Ultimate Guide to Building Efficient Queries, by Henrietta Dombrovskaya and Boris Novikov Anna Bailliekova, published by Apress in 2021.
|
* Every optimization problem needs data on costs, and the more and better the data is, the better. Postgres has made some improvements here, especially with cross column statistics, but there are still massive improvements left on the table. The most glaring omission is data on syscall latencies. Reading a page from disk has dramatically different latencies from system to system, and postgres still relies on configuration values for these costs, when it could very easily be measuring them. Another omission is foreign key statistics. Joins along foreign keys should never have a bad plan, but they still occasionally do.
* Deferred or alternative scenario planning should be adopted, especially for large and expensive queries. As it is today, your plan is finalized before it is executed, even though earlier stages of execution could provide information (like rowcounts or cardinality estimates) that could dramatically improve later stage plans.
* Machine learning most definitely could be an area where improvements could be made, but I've been unimpressed with the efforts that Ive seen. Don't use machine learning for planning, use machine learning for cost discovery and estimation. Build better cost models, and then let the optimization engine work with that data.