SciPy.stats.Chatterjeexi
Scipy.stats. Chatterjeexi

原始链接: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chatterjeexi.html

这段代码演示了 Chatterjee-Xi 相关性的使用,这是一种衡量两个变量之间关联性的指标,尤其适用于标准相关性方法不太适用时。 示例首先生成完全相关的数据 (y = sin(x)),并显示 Xi 统计量接近 1.0,p 值非常低,表明存在强关联。向 'y' 引入噪声可预测地*降低*了该统计量。代码还确认,当 'y' 是连续变量时,指定 `y_continuous=True` 不会改变结果。 最后,脚本解决了 'x' 变量中存在并列值的问题。它表明并列值会影响统计量,并建议随机打破并列值,或者对多个随机打破并列值的场景取平均结果以获得更稳健的估计——尽管后者可能计算成本较高。代码说明了这两种方法。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Scipy.stats. Chatterjeexi (scipy.org) 9 分,来自 kamaraju 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 simlevesque 1分钟前 [–] 论文:https://arxiv.org/abs/1909.10140 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Generate perfectly correlated data, and observe that the xi correlation is nearly 1.0.

>>> import numpy as np
>>> from scipy import stats
>>> rng = np.random.default_rng()
>>> x = rng.uniform(0, 10, size=100)
>>> y = np.sin(x)
>>> res = stats.chatterjeexi(x, y)
>>> res.statistic
np.float64(0.9012901290129013)

The probability of observing such a high value of the statistic under the null hypothesis of independence is very low.

>>> res.pvalue
np.float64(2.2206974648177804e-46)

As noise is introduced, the correlation coefficient decreases.

>>> noise = rng.normal(scale=[[0.1], [0.5], [1]], size=(3, 100))
>>> res = stats.chatterjeexi(x, y + noise, axis=-1)
>>> res.statistic
array([0.79507951, 0.41824182, 0.16651665])

Because the distribution of y is continuous, it is valid to pass y_continuous=True. The statistic is identical, and the p-value (not shown) is only slightly different.

>>> stats.chatterjeexi(x, y + noise, y_continuous=True, axis=-1).statistic
array([0.79507951, 0.41824182, 0.16651665])

Consider a case in which there are ties in x.

>>> x = rng.integers(10, size=1000)
>>> y = rng.integers(10, size=1000)

[1] recommends breaking the ties uniformly at random.

>>> d = rng.uniform(1e-5, size=x.size)
>>> res = stats.chatterjeexi(x + d, y)
>>> res.statistic
-0.029919991638798438

Since this gives a randomized estimate of the statistic, [1] also suggests considering the average over all possibilities of breaking ties. This is computationally infeasible when there are many ties, but a randomized estimate of this quantity can be obtained by considering many random possibilities of breaking ties.

>>> d = rng.uniform(1e-5, size=(9999, x.size))
>>> res = stats.chatterjeexi(x + d, y, axis=1)
>>> np.mean(res.statistic)
0.001186895213756626
联系我们 contact @ memedata.com