存储在 SQL 中的统计信息
Statistics that live in your SQL

原始链接: https://kolistat.com/blog/the-stats-duck-v0-6-0/

开源扩展 **the-stats-duck** (v0.6.0, "i-m-not-dead") 发布了重大更新,支持直接在 DuckDB 中通过 SQL 进行高级统计分析。 该扩展允许用户无需离开 SQL 环境即可执行复杂任务,包括回归分析、分布计算和自助法(bootstrapping)。v0.6.0 版本的核心功能包括: * **统计函数:** 增强了用于数据分析的 `meta()` 函数,提供了全面的线性建模(`lm`、`lm_summary`),以及经 R 语言交叉验证的各种概率分布和随机采样功能。 * **可视化:** 更新了 `ggsql` 语法,新增了小提琴图、二维分面绘图,以及用于直观数据探索的 `smooth` (LOESS) 和 `regression` 等统计修饰符。 * **性能:** 对 SAS/SPSS/Stata 文件读取器进行了大幅重构,通过优化流式处理,使大数据集的读取性能提升了 36 到 50 倍。 *the-stats-duck* 基于 MIT 协议开源,兼容任何 DuckDB 实例(包括浏览器端),使 DuckDB 摇身一变成为功能完备的统计引擎。目前它已支持 Bedevere 和 KoliLang,任何希望将稳健的统计分析直接集成到 SQL 工作流中的用户均可使用。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 驻留在 SQL 中的统计数据 (kolistat.com) 12 分,由 caerbannogwhite 于 2 小时前发布 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

the-stats-duck — our open-source DuckDB extension — just shipped v0.6.0, and in keeping with the whole operation, the release is named i-m-not-dead. It very much isn't!

In case you haven't met it before: the-stats-duck is the statistics engine that Bedevere and KoliLang lean on. It allows DuckDB to do real statistics — distributions, tests, regression, even plots — without ever leaving SQL. MIT-licensed, and it runs anywhere DuckDB runs — including a browser!

Which is quite handy, because every demo below is a live Bedevere instance running the-stats-duck right here in your browser. The dataset is the famous Palmer Penguins — change the SQL and re-run it.

SELECT * FROM 'penguins';

The first thing anyone would usually do with a new dataset is squint at it. meta() squints for you the full profile — one row per column:

SELECT
  column_name, kind, n_missing
  , n_distinct, mean, median, stddev, top
FROM meta('penguins')
;

It overlaps with DuckDB's built-in SUMMARIZE, but meta() is a table function — so you can join it, filter it, and compose it in CTEs. Do you need to know "how many numeric columns, and how many missing values total?" That's just aggregation over meta().

Yes, ordinary least squares, in a SQL query, with the formula syntax you already know:

SELECT *
FROM lm_summary(
  'penguins'
  , formula := 'body_mass_g ~ flipper_length_mm + bill_length_mm'
);

lm() gives you the coefficient table (estimate, std_error, t_statistic, p_value per term); lm_summary() gives the model-level line (R², adjusted R², F and its p-value, residual df, sigma). The formula language handles additive predictors and intercept removal (- 1); interactions and inline transforms aren't in v0.6 yet, but we might add them soon!

Under the hood it's a Cholesky solve of X'X, complete-case filtered, and cross-verified against R on cars and mtcars to four decimals.

Don't want to assume normality? Resample:

WITH b AS (
  SELECT bootstrap(body_mass_g, 'mean', 2000, 42) AS samples
  FROM penguins
)
SELECT
  list_aggregate(samples, 'quantile_cont', 0.025) AS lo
  , list_aggregate(samples, 'quantile_cont', 0.975) AS hi
FROM b
;

bootstrap(value, statistic, n_iters [, seed]) resamples with replacement and hands back a LIST<DOUBLE> of the statistic from each resample (mean, median, stddev, …). Pair it with list_aggregate(list, 'quantile_cont', p) for a percentile interval. Pass a seed and it's reproducible — even across GROUP BY groups.

the-stats-duck also speaks a small — for now — plot grammar (ggsql) that compiles to Vega-Lite, which Bedevere renders. v0.6.0 adds violin marks, 2-D facets (FACET BY row, col), and a per-layer STAT smooth (LOESS) — so DRAW point DRAW line STAT smooth is your scatter-with-trend in one line.

VISUALIZE
  species AS x
  , body_mass_g AS y
FROM penguins
DRAW violin
TITLE 'Body Mass (g) by species'
;

The other half of v0.6.0's chart story is the per-layer STAT modifier — appended after a mark as DRAW <mark> STAT <name>. It transforms that layer only, so you can stack a raw mark under a statistical one. The canonical use is a scatter with a LOESS overlay:

VISUALIZE
  bill_depth_mm AS x
  , bill_length_mm AS y
  , species AS color
FROM penguins
DRAW point
DRAW line STAT smooth
SCALE x ZERO false
SCALE x LABEL 'Bill Depth (mm)'
SCALE y ZERO false
SCALE y LABEL 'Bill Length (mm)'
TITLE 'Penguins Bill Depth VS Bill Length (Smooth)'
;

Three modifiers ship. smooth injects a Vega-Lite LOESS transform on (x, y), grouped by color when it's mapped — one fitted curve per species above. summary rewrites the layer's data SQL to AVG(y) GROUP BY x [, color, facet, facet2] ORDER BY x, collapsing each cell to its mean. identity is the explicit no-op. smooth is rejected on marks that already emit their own transform — regression, density, violin, histogram — which bake the statistic in themselves.

Where STAT smooth traces the data's local shape, the regression mark draws a single straight line — a least-squares y ~ x fit, grouped by color. Same axes as above, but each species gets a linear trend instead of a LOESS curve:

VISUALIZE
  bill_depth_mm AS x
  , bill_length_mm AS y
  , species AS color
FROM penguins
DRAW point
DRAW regression
SCALE x ZERO false
SCALE x LABEL 'Bill Depth (mm)'
SCALE y ZERO false
SCALE y LABEL 'Bill Length (mm)'
TITLE 'Penguins Bill Depth VS Bill Length (Regression)'
;

The distribution zoo grew: negative binomial, hypergeometric, Weibull, log-normal, and Poisson, each with the R-style d / p / q triple — plus a complete set of random samplers: rnorm, rt, rchisq, rf, rgamma, rbeta, rexp, rweibull, rlnorm, rpois. Every one cross-checked against R to six decimals. Simulate, fit, and test without round-tripping to another language.

Here's the Poisson(λ = 3) probability mass function — dpois evaluated over 0…10 and drawn as a bar chart, all in SQL:

WITH pois AS (
  SELECT k, dpois(k, 3) AS pmf
  FROM range(0, 11) AS t(k)
)
VISUALIZE
  k AS x
  , pmf AS y
FROM pois
DRAW bar
;

Last, and this one's for the XPT crowd: reading SAS / SPSS / Stata files used to be accidentally O(N²): the reader re-parsed the file from byte zero for every 2,048-row chunk! 😱

v0.6.0 parses each file exactly once into a buffered, spillable column store and streams from it. Here are the numbers:

  • a 200k-row, 7 MB XPT: ~67 s → ~1.3 s (52x)
  • the real CDISC pilot qs.xpt (122k rows): ~39 s → ~1.1 s (36x)
  • 1.6M rows: ~70 min → ~15 s

the-stats-duck is open source: github.com/KoliStat/the-stats-duck (full v0.6.0 notes here). It powers Bedevere and KoliLang, but it's just a DuckDB extension: drop it into any DuckDB and your SQL grows a statistics department.

Not dead. Doing your statistics.

联系我们 contact @ memedata.com