Google 在 SQL 中的新管道语法

Google 在 SQL 中的新管道语法
Google's new pipe syntax in SQL

原始链接: https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/

本文讨论了一种新开发的 SQL 语法，称为“管道查询语法”，旨在解决 SQL 的主要可用性问题之一：查询中子句的顺序。传统上，SQL 查询以“FROM”开头，这常常会导致混乱。相反，管道查询语法使用管道 (|) 分隔查询的组成部分。提供了一个演示传统语法和管道语法之间差异的示例： ````` 传统SQL：选择组件 ID，COUNT(*) FROM 票务系统表 WHERE allocateee_user.email = '用户名@email.com' AND 状态 IN（“新”、“已分配”、“已接受”）按组件 ID 分组按组件 ID DESC 排序；管道查询语法： FROM 票务系统表 |>> WHERE allocateee_user.email = '用户名@email.com' AND 状态 IN（“新”、“已分配”、“已接受”） |>> 合计计数(*) 按组件 ID 分组按组件 ID DESC 排序； ````` 作者还使用 Google 的 AI Assistant 将论文从原始 PDF 格式转换为 HTML，虽然并不完美，但表现相对较好。尽管远非完美无缺，但创建这样一个易于访问的在线版本的论文展示了使用这些技术可以带来的潜在好处。然而，通过投入更多的时间和资源可以取得进一步的改进。

Group By All 允许根据 SELECT 语句中包含的列自动指定要分组的所有列，从而简化 SQL 查询。 BigQuery 支持此功能。使用 Group By All 可以节省在 GROUP BY 子句中显式列出每一列的精力，并减少手动编写 GROUP BY 子句时因不匹配或忘记包含某些列而导致的错误。作者发现此功能非常有用，因为它与过滤和转换数据的心理过程相似，而且它利用现有的 SQL 知识而无需学习新语法。此外，作者很欣赏与现有代码库集成的便利性，因为可以逐渐添加包含此功能的新查询，而无需立即检修整个系统。虽然作者欣赏 PRQL 的简单性，但他更喜欢建议的管道 SQL 语法，因为它与现有系统兼容并最大限度地减少了迁移工作。然而，作者认识到潜在的批评，认为与 PRQL 提供的全面改进相比，该功能只是部分解决方案。尽管如此，作者认为，与彻底的改革相比，解决核心问题的渐进增强可以带来更大的采用率。最后，作者解决了有关查询可读性和自动完成的问题。他建议根据执行顺序组织查询，特别是从 SELECT 语句开始，后跟 FROM 子句，以便于理解预期输出并用于故障排除。此外，他认为查询优化和编译发生在两个子句编写之后，使得语句的初始排列与性能无关。最后，他建议用户可以根据个人偏好和使用上下文选择以不同的方式构建查询。

原文

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL (via) A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.

A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with SELECT instead of FROM has always been confusing, see SQL queries don't start with SELECT by Julia Evans.

Here's an example of the new alternative syntax, taken from the Pipe query syntax documentation that was added to Google's open source ZetaSQL project last week.

For this SQL query:

SELECT component_id, COUNT(*)
FROM ticketing_system_table
WHERE
  assignee_user.email = '[email protected]'
  AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
GROUP BY component_id
ORDER BY component_id DESC;

The Pipe query alternative would look like this:

FROM ticketing_system_table
|> WHERE
    assignee_user.email = '[email protected]'
    AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|> AGGREGATE COUNT(*)
   GROUP AND ORDER BY component_id DESC;

The Google Research paper is released as a two-column PDF. I snarked about this on Hacker News:

Google: you are a web company. Please learn to publish your research papers as web pages.

This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see this Mastodon conversation) and are generally just bad citizens of the web.

Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to Google AI Studio and prompted the gemini-1.5-pro-exp-0801 model like this:

Convert this document to neatly styled semantic HTML

This worked surprisingly well. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.

Here's the result (with a banner I added at the top explaining that it's a conversion): Pipe-Syntax-In-SQL.html

I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.

The figures from the PDF aren't present - Gemini Pro output tags like <img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)"> but did nothing to help me create those images.

Amusingly the document ends with <p>(A long list of references, which I won't reproduce here to save space.)</p> rather than actually including the references from the paper!

So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.

One last amusing note: I posted a link to this to Hacker News a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!

I've now added a <meta name="robots" content="noindex, follow"> tag to the top of the HTML to keep this unverified AI slop out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!

Google 在 SQL 中的新管道语法 Google's new pipe syntax in SQL

Google 在 SQL 中的新管道语法
Google's new pipe syntax in SQL