比较用于A/B测试分析的Python包（附代码示例）

比较用于A/B测试分析的Python包（附代码示例）
Comparing Python packages for A/B test analysis (with code examples)

原始链接: https://e10v.me/python-packages-for-ab-test-analysis/

## A/B 测试的 Python 包：比较本文比较了四个 Python 包 – **tea-tasting**、**Pingouin**、**statsmodels** 和 **SciPy** – 用于分析 A/B 测试结果。它没有评选出唯一的“赢家”，而是阐明了每个包在常见实验任务中的优势以及为获得生产就绪的输出所需的体力工作量。分析重点在于典型的 A/B 测试工作流程：设计实验、运行测试和分析结果（计算指标值、效应量、置信区间和 p 值）。关键指标类型包括平均值、平均值比率和比例。 **tea-tasting** 专门为 A/B 测试设计，提供简化的工作流程和内置功能，如功效分析、相对效应计算和方差缩减（CUPED）。**Pingouin** 擅长快速、以 pandas 为中心的统计分析，但需要更多体力工作来进行 A/B 特定的计算。**statsmodels** 提供了强大的统计构建块，非常适合希望进行显式控制的分析师。**SciPy** 是一个基础包，提供低级工具，需要最多的自定义编码。这些包在对相对效应置信区间和多重假设检验校正等功能的支持方面有所不同，从“内置”到“手动”实现不等。最终，最佳选择取决于实验需求的复杂程度和所需的自动化程度。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录比较用于A/B测试分析的Python包（附代码示例） (e10v.me) 5 分，由 e10v_me 1小时前发布 | 隐藏 | 过去 | 收藏 | 1 条评论求助 e10v_me 1小时前 [–] 我发布了一篇关于Python包在A/B测试分析中实用比较的文章：tea-tasting, Pingouin, statsmodels 和 SciPy。与其选择一个“最佳”工具，不如分解每个包的适用范围以及生产式实验报告需要多少手动工作。包含代码示例和跨功效分析、比率指标、相对效应置信区间、CUPED、多重检验校正以及处理聚合统计数据以提高效率的功能矩阵。声明：我也是tea-tasting的作者。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Mar 01, 2026 a/b testing statistics tea-tasting python

Disclosure: I am also the author of tea-tasting.

This article compares four Python packages that are relevant to A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy. It does not try to pick a universal winner. Instead, it clarifies what each package does well for common experimentation tasks and how much manual work is needed to produce production-style A/B test outputs.

It assumes familiarity with A/B testing basics, including randomization, p-values, and confidence intervals.

A/B test setting and analysis requirements #

A/B tests in a nutshell #

An A/B test compares two (or more) variants of a product change by randomly assigning experimental units to variants and measuring outcomes. In online experiments, the randomization unit is usually the user, and the standard assumption is that units are independent.

A typical workflow is:

Design the experiment: choose the randomization unit (usually users), define the target population, and estimate sample size and duration with power analysis.
Run the experiment: ship the treatment, randomize traffic, and collect data.
Analyze and interpret results: compute control and treatment metric values, estimate effects with confidence intervals, and report p-values.

Good references for this mindset include books and papers by Ron Kohavi and Alex Deng, especially on trustworthy experimentation, delta-method metrics, and CUPED.

Typical metric types and tests #

The table below summarizes common A/B test metric families and the tests usually applied.

Metric type	Examples	Typical test
Average	Average revenue per user, average orders per user	Welch's t-test (unequal-variance two-sample t-test)
Ratio of averages	Average revenue per order, average orders per session	Welch's t-test with variance from the delta method
Proportion	Proportion of users with at least one order	Asymptotic tests (Z-test, G-test, Pearson's chi-squared) or exact tests (Boschloo, Barnard, Fisher)

Typical metric types and tests

Typical per-metric analysis output #

For each metric, analysts usually want the same core fields:

Metric value in control.
Metric value in treatment.
Effect size estimate and confidence interval (absolute and/or relative).
P-value.

This output format is what makes A/B test analysis convenient for repeated use across many experiments.

A/B testing specifics #

Some details matter a lot in real experimentation workflows.

Relative effect size is often easier to interpret than absolute effect size. A common mistake is to divide an absolute confidence interval by the control mean. That does not generally produce a valid confidence interval for a relative effect. Use the delta method or Fieller's theorem instead (both are standard approaches for ratio-style uncertainty).
Variance reduction (especially CUPED, which uses pre-experiment covariates) is widely used to increase power. CUPED for ratio metrics is more subtle because it combines variance reduction with ratio estimators and delta-method variance.
Multiple hypothesis testing correction becomes important when you track many metrics or compare multiple variants. In practice, teams usually need a clear FWER/FDR policy and adjusted p-values (or adjusted significance thresholds) in experiment reports.
Efficiency matters. Teams often analyze many metrics across many experiments. For many tests, only aggregated statistics are required (count, mean, variance, covariance). In those cases, it is usually more efficient to compute aggregates in the data backend and send only summary statistics to Python. A convenience layer for aggregate fetching can reduce both code and latency.

Example data setup #

With that context, the examples below use the same synthetic experiment dataset generated with tea-tasting. They intentionally do not use variance reduction (no CUPED), so the package comparisons stay focused on baseline analysis.

import tea_tasting as tt

data = tt.make_users_data(return_type="pandas", rng=42, n_users=5_000)
data["has_order"] = (data["orders"] > 0).astype(int)

control = data[data["variant"] == 0]
treatment = data[data["variant"] == 1]

print(data.head(3).to_string(index=False))

The metrics used in the examples are:

orders_per_user: average orders per user.
users_with_orders: proportion of users with at least one order.
revenue_per_user: average revenue per user.
revenue_per_order: average revenue per order (ratio of averages).

All package examples below reuse data, control, and treatment from this setup.

Package-by-package comparison #

tea-tasting #

tea-tasting is a package specifically designed for A/B test analysis. It targets experimentation workflows directly, with metrics, relative effects, CUPED, power analysis, and concise experiment-style outputs.

Best for: teams that want an A/B-testing-first workflow with minimal glue code.

This is the most compact example among the four packages because it provides A/B-specific metric classes and a high-level Experiment API.

experiment = tt.Experiment(
    orders_per_user=tt.Mean("orders"),
    users_with_orders=tt.Proportion("has_order", correction=False),
    revenue_per_user=tt.Mean("revenue"),
    revenue_per_order=tt.RatioOfMeans("revenue", "orders"),
)
result = experiment.analyze(data)

print(result)

A/B testing specifics:

Power analysis: built-in (Experiment.solve_power and metric parameters for effect size, relative effect size, and sample size).
Relative effect and confidence interval: first-class output (including relative CI in the printed result).
CUPED: built-in for averages and ratio metrics.
Multiple hypothesis testing correction: built-in (tt.adjust_fdr and tt.adjust_fwer) for experiment results, including FDR and FWER procedures.
Aggregated statistics workflow: built-in support via metrics that expose required aggregates and integration with data backends (e.g., through Ibis-supported engines).

Pingouin #

Pingouin is a user-friendly statistical package focused on convenient inferential statistics in pandas-centric workflows. It is strong for common tests and effect sizes, but it is not an A/B-specific framework.

Best for: quick pandas-based analyses of standard statistical tests.

Pingouin has a convenient t-test interface and a contingency-table chi-squared helper. It does not provide a built-in ratio-of-averages test with delta-method variance, so the revenue_per_order example is omitted.

import pingouin as pg

orders_test = pg.ttest(
    treatment["orders"],
    control["orders"],
    correction=True,
).iloc[0]
print(
    "orders_per_user: "
    f"control={control['orders'].mean():.3f} "
    f"treatment={treatment['orders'].mean():.3f} "
    f"effect_size="
    f"{treatment['orders'].mean() - control['orders'].mean():.3f} "
    f"effect_size_ci={orders_test['CI95']} "
    f"pvalue={orders_test['p_val']:.4f}"
)


_, _, tests = pg.chi2_independence(
    data,
    x="variant",
    y="has_order",
    correction=False,
)
pearson = tests.loc[tests["test"] == "pearson"].iloc[0]
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"effect_size="
    f"{treatment['has_order'].mean() - control['has_order'].mean():.3f} "
    f"pvalue={pearson['pval']:.4f}"
)

Notes:

revenue_per_user uses the same pg.ttest(...) pattern as orders_per_user.
revenue_per_order (ratio of averages) requires manual derivation if you want a statistically correct delta-method analysis.
The proportion example above shows a p-value, but not a built-in A/B-style effect CI.

A/B testing specifics:

Power analysis: built-in for standard cases (power_ttest, power_ttest2n, power_chi2), but not an A/B-specific multi-metric workflow.
Relative effect and confidence interval: mostly manual for A/B-style relative lift CIs.
CUPED: no built-in CUPED abstraction. You would implement variance reduction manually (for example, with regression).
Multiple hypothesis testing correction: built-in p-value adjustment via pg.multicomp (for example, Bonferroni, Holm, and FDR methods), but integration into an A/B reporting workflow is manual.
Aggregated statistics workflow: mostly expects granular arrays/DataFrames; no built-in A/B aggregate interface.

statsmodels #

statsmodels is a broad statistical modeling library with strong hypothesis testing, power analysis, and confidence interval utilities. It is less opinionated than an experimentation-specific package, which is a strength if you want building blocks and explicit control.

Best for: analysts who want mature statistical building blocks and are comfortable assembling a workflow.

The example below uses Welch-style t-tests for two average metrics and a risk-ratio test/CI for the proportion metric. revenue_per_order is omitted because there is no built-in A/B-style ratio-of-averages delta-method helper.

from statsmodels.stats.proportion import (
    confint_proportions_2indep,
    test_proportions_2indep,
)
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW

def welch_summary(treatment_series, control_series):
    cm = CompareMeans(
        DescrStatsW(treatment_series),
        DescrStatsW(control_series),
    )
    _, pvalue, _ = cm.ttest_ind(usevar="unequal")
    ci_low, ci_high = cm.tconfint_diff(usevar="unequal")
    return (
        control_series.mean(),
        treatment_series.mean(),
        treatment_series.mean() - control_series.mean(),
        ci_low,
        ci_high,
        pvalue,
    )

for metric in ["orders", "revenue"]:
    ctrl, trt, effect, ci_low, ci_high, pvalue = welch_summary(
        treatment[metric],
        control[metric],
    )
    print(
        f"{metric}_per_user: "
        f"control={ctrl:.3f} treatment={trt:.3f} effect_size={effect:.3f} "
        f"effect_size_ci=[{ci_low:.3f}, {ci_high:.3f}] pvalue={pvalue:.4f}"
    )



count1 = int(treatment["has_order"].sum())
nobs1 = len(treatment)
count0 = int(control["has_order"].sum())
nobs0 = len(control)
prop_test = test_proportions_2indep(
    count1=count1,
    nobs1=nobs1,
    count2=count0,
    nobs2=nobs0,
    compare="ratio",
    method="log",
)
prop_ci = confint_proportions_2indep(
    count1=count1,
    nobs1=nobs1,
    count2=count0,
    nobs2=nobs0,
    compare="ratio",
    method="log",
)
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"rel_effect_size={prop_test.ratio - 1:.3f} "
    f"rel_effect_size_ci=[{prop_ci[0] - 1:.3f}, {prop_ci[1] - 1:.3f}] "
    f"pvalue={prop_test.pvalue:.4f}"
)

A/B testing specifics:

Power analysis: strong built-in support (TTestIndPower, NormalIndPower, GofChisquarePower, and more).
Relative effect and confidence interval: partial. There are built-in options for some cases (for example, risk ratios for proportions), but no general A/B-style relative lift CI interface across metric types.
CUPED: no built-in one-call CUPED API, but it is practical to implement manually with regression tooling.
Multiple hypothesis testing correction: strong built-in support (statsmodels.stats.multitest, including multipletests and fdrcorrection).
Aggregated statistics workflow: partial. Proportion tests work directly from counts/sample sizes; other workflows often still require granular arrays or more manual setup.

SciPy #

SciPy is a foundational scientific computing and statistics package used directly or indirectly by many higher-level libraries, including the others in this comparison. It provides robust hypothesis tests and exact tests, but it does not provide a high-level A/B testing workflow.

Best for: low-level building blocks and custom A/B analysis code.

This snippet shows a Welch t-test for orders_per_user and a Pearson chi-squared test for users_with_orders. Exact tests such as Fisher, Barnard, and Boschloo are also available in SciPy.

import numpy as np
from scipy import stats

orders_test = stats.ttest_ind(
    treatment["orders"],
    control["orders"],
    equal_var=False,
)
orders_ci = orders_test.confidence_interval()
print(
    "orders_per_user: "
    f"control={control['orders'].mean():.3f} "
    f"treatment={treatment['orders'].mean():.3f} "
    f"effect_size="
    f"{treatment['orders'].mean() - control['orders'].mean():.3f} "
    f"effect_size_ci=[{orders_ci.low:.3f}, {orders_ci.high:.3f}] "
    f"pvalue={orders_test.pvalue:.4f}"
)


contingency = np.array(
    [
        [(control["has_order"] == 0).sum(), (control["has_order"] == 1).sum()],
        [
            (treatment["has_order"] == 0).sum(),
            (treatment["has_order"] == 1).sum(),
        ],
    ]
)
chi2_res = stats.contingency.chi2_contingency(contingency, correction=False)
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"effect_size="
    f"{treatment['has_order'].mean() - control['has_order'].mean():.3f} "
    f"pvalue={chi2_res.pvalue:.4f}"
)

Notes:

revenue_per_user uses the same stats.ttest_ind(..., equal_var=False) pattern as orders_per_user.
revenue_per_order (ratio of averages) requires manual delta-method implementation.
Relative effect confidence intervals are also manual.

A/B testing specifics:

Power analysis: mostly manual in SciPy (or delegated to custom code / another package).
Relative effect and confidence interval: manual.
CUPED: manual.
Multiple hypothesis testing correction: partial. SciPy provides scipy.stats.false_discovery_control for BH/BY FDR adjustment, but broader multiple-comparison correction workflows are more limited than in statsmodels.
Aggregated statistics workflow: partial. SciPy supports some summary-statistics and contingency-table tests (for example, ttest_ind_from_stats and chi-squared/exact tests on contingency tables), but not an A/B-specific aggregate workflow.

Feature comparison for A/B testing #

SciPy underpins much of the Python statistics ecosystem. In principle, all the capabilities discussed here can be implemented with NumPy + SciPy plus custom code. The practical question is convenience and code verbosity. For that reason, the table below uses three labels:

built-in: directly supported in a way that fits common A/B analysis tasks.
partial: some built-in support exists, but not as a complete or ergonomic A/B workflow.
manual: possible, but requires custom implementation/glue code.

Feature	tea-tasting	Pingouin	statsmodels	SciPy
Power analysis to estimate required number of observations	built-in	built-in	built-in	manual
Welch's t-test or Student's t-test for analysis of averages	built-in	built-in	built-in	built-in
Welch's t-test with delta method for analysis of ratios of averages	built-in	manual	manual	manual
Two-sample proportion Z-test, G-test, or Pearson's chi-squared test	built-in	built-in	built-in	built-in
Relative effect size confidence intervals	built-in	manual	partial	manual
Variance reduction with CUPED for analysis of averages and ratios of averages	built-in	manual	manual	manual
Multiple hypothesis testing correction (FWER/FDR p-value adjustment)	built-in	built-in	built-in	partial
Working with aggregated statistics instead of granular data	built-in	manual	partial	partial

Feature comparison for A/B testing

Conclusions #

The four packages sit at different levels of abstraction.

tea-tasting is the most A/B-specific option in this group. It is designed around metrics, experiments, relative effects, CUPED, and aggregate-based workflows.
Pingouin is convenient for standard statistical tests and quick analysis in pandas, but A/B-specific workflows (especially ratio metrics, relative CIs, and CUPED) are mostly manual.
statsmodels provides strong statistical building blocks and power analysis. It is a good choice when you want explicit control and are willing to assemble an experimentation workflow yourself.
SciPy is the essential foundation. It can support almost everything with enough custom code, but it is the most verbose option for repeated A/B test reporting.

If you run many experiments with multiple metrics and need consistent outputs, the main differentiator is not just statistical correctness. It is how much A/B-specific workflow a package gives you out of the box.

Inclusion criteria for the comparison #

For transparency, here are the minimum criteria I used to decide which packages to include.

Maintained: a recent release (for example, within a year) and recent commits (for example, within two months) reduce the risk of stale APIs and unresolved compatibility issues.
Well documented: a package should have both a user guide and an API reference, because A/B testing code is often reused by analysts with different levels of statistical depth.
Used by a community: a practical heuristic is at least 100 GitHub stars. This is not a quality guarantee, but it often means more examples and more edge cases have already been surfaced.

Scope note: This comparison focuses on frequentist A/B testing workflows. Bayesian-first experimentation frameworks are out of scope.

Excluded notable packages (and why):

spotify_confidence: excluded because it lacks documentation and has not added significant new features in the last couple of years.
ambrosia: excluded because it has not added new features in the last couple of years, aside from dependency version updates.

Note: The maintenance and documentation notes in this section are assessed as of March 1, 2026.

Resources #

The comparison above is based on the public documentation and APIs of the packages as of March 1, 2026. Current stable versions on PyPI at the time of writing:

A/B testing and statistics references: