我不认为林德利悖论支持p-循环。
I don't think Lindley's paradox supports p-circling

原始链接: https://vilgot-huhn.github.io/mywebsite/posts/20251206_p_circle_lindley/

## P值徘徊之谜 本文探讨了“P值徘徊”现象——质疑那些P值略低于传统显著性阈值0.05的结果。这个阈值大约在100年前被任意设定,尽管一直存在争议,但至今仍在使用。 作者深入探讨了P值的底层逻辑,其根源在于 Neyman-Pearson 框架,该框架侧重于在重复检验零假设时的错误率。一个关键点是:在真实的零假设下,P值应该均匀分布——任何值出现的可能性都相同。 然而,人们担心“P值操纵”(为了达到显著性而操纵数据)。模拟表明,诸如在达到期望的P值后停止数据收集之类的做法会扭曲这种均匀分布。这种扭曲*可能*证明对接近0.05的P值持怀疑态度是合理的。 进一步的复杂性来自于“林德利悖论”,即如果存在真实效应且研究具有高统计功效,那么低P值反而会*更*可能出现。这表明,在功效高的研究中,P=0.048并不一定比P=0.001提供更弱的证据。最终,作者得出结论,P值本身并不能充分证明学术不端行为,并提倡将P值置于研究功效和潜在偏差的背景下进行解读,并建议使用贝叶斯因子作为证据的更直接衡量标准。 本文是对一个复杂话题的个人探索,承认可能存在误解,并邀请反馈。

## 黑客新闻讨论摘要:林德利悖论与P值解读 黑客新闻讨论围绕一篇质疑使用林德利悖论来支持“P值循环”(强调先前显著但现在不再显著的P值)的博文展开。 许多评论员反驳了作者对林德利悖论的理解,澄清它是一种频率学和贝叶斯分析之间可能出现较大差异的现象,即使在合理的先验条件下也是如此,而不仅仅是功效计算的怪癖。 核心争论集中在人们关注P值的*原因*上。 一位评论员认为,这是因为研究人员关心的是现实世界的真相,而不是统计显著性的技术定义。 其他人强调在数据收集*之前*仔细考虑I型和II型错误的的重要性,但也承认由于研究激励而存在实际挑战。 讨论还涉及贝叶斯分析中先验有效性,一些人为在特定情况下使用看似极端的先验辩护。 最终,对话突出了统计解释的复杂性以及区分贝叶斯更新与先验假设的必要性。 还有一个Veritasium视频解释了带有偏斜先验的悖论也被分享。
相关文章

原文

As usual I’d like to preface all this that I write these blogposts as attempts to make sense of a subject for my own sake. I am not an expert here and it is likely I am confused about some details. On the other hand, I think “confused” discourse can also be productive to read and participate in. Being confused is just the first step towards being unconfused, to paraphrase Jake The Dog.

p-value circling

100 years ago this year Fisher arbitrarily suggested using p < 0.05 as a cut-off for “significant” and ever since we’ve just gone along with it. “Why is it 0.05?” people have critically asked for one hundred years. Unfortunately “arbitrariness”, as a critique, is only effective if you are able to suggest less a arbitrary value, and despite many efforts to change this the convention has remained.

The act of p-value circling is to look at a p-value that’s significant but close to 0.05 and go: “hm, I don’t know about that…” Perhaps you use a red ballpoint pen to circle it on the print journal you subscribe to in the year 2025. If not, you may underline it with some sort of digital pen technology and share it online.

“Hmm… Suspicious…”

What (potentially) justifies p-value circling?

Before we get into it let’s briefly try to remind ourselves what p-values are even supposed to do. (This will be a brief summary, if you want to learn this for real I recommend reading Daniël Lakens free online textbook, which all this borrows heavily from.)

As far as I’ve understood, Fishers idea about p-values was supplanted by the more rigorous (in terms of statistical philosophy) Neyman-Pearson framework. It is within this framework we find the familiar type 1 and type 2 error rates. Probability is viewed as being about outcomes in a hypothetical scenario where a procedure is repeated many times. You’re actually supposed to set the \(\alpha\) at a level that’s justifiable based on what null hypothesis you’re testing. As far as I’ve understood no one has ever done so, except that one time physicists at CERN decided they wanted to be really sure they didn’t incorrectly claim they found the Higgs boson. Instead everyone just uses the arbitrary convention of \(\alpha = 0.05\).

If you assume that the null is true, the p-value distribution is uniform. Let’s do the exercise of generating a hundred thousand t-tests between two groups, n = 100 per group, where there is no mean difference. Then we’ll look at the p-values.

they dance around, so even if you think the practice is common I don’t think you should put a lot of weight in the idea that whatever questionable statistical jutstu a researcher does to avoid a null result will put their (hacked) p-value just below the threshold.

Basically I don’t think a single p-value in and of itself can carry a lot of information about statistical malpractice. I’m sympathetic to the rule that if you’ve ever scoffed at someone using the term “marginally significant”, you’re not allowed to call something “marginally insignificant”.

Lindley’s paradox

A potentially more sophisticated justification for p-circling is “Lindley’s paradox.”

I think many instinctively feel some resistance to a very strict interpretation of p-values, where their sole function is to be uniform under the null and the thing we care about is whether they clear our pre-specified alpha level. After all, we rightfully get annoyed when p-values are reported only with a less-than sign. And should we really not feel confident that there’s something there when we see a p = 0.001?

In this group-difference setup a smaller p-value implies a larger mean difference in your sample means, and you’re more likely to come across a large mean difference if you happen to be in a universe where there truly is a difference between the groups.

Let’s see how it looks like when we have a difference between the groups.

Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level. However, I doubt this structured approach is what lies behind the casual circling of p-values I’ve come across online over the years. My impression is that most social media p-circling haven’t been studies with very high power to detect small effects.

There is a concern that very large studies may pick up on “noise”, or that other violations of model assumptions (e.g. normality) tend to bias p-values downward. I don’t really know what to make of these concerns. I think that might be true for some model violations, while other may hurt power instead. I would assume it’s a complicated empirical question whether the statistical models we use tend to misfit reality more in one direction rather than the other.

Regardless, I don’t think it can be salvaged as a ground for being skeptical of p-values close to their threshold because of Lindley’s paradox.

For the moment I feel safest treating the conventional threshold as what it is, as arbitrary as that is. I’m of course concerned about QRPs and p-hacking, but I don’t see a reason for why a single p-value close to 0.05 would be useful evidence of it.

Some concluding thoughts

As I prefaced, this is complicated stuff and I have probably gotten something wrong. Regarding the larger question on whether p-values can be interpreted as evidence, I currently land in the conclusion “not in and of themselves”, they have to be contextualized in relation to power and other features of the study, as well as the context you come across them in. Lindley’s paradox can be a useful illustration of one of the reasons that the interpretation isn’t straight-forward (but I don’t think it justifies p-circling).

On the other hand, I know that smarter and more well-read people than I disagree on how straightforward this interpretation is. The textbook we used in my PhD-level course in medical statistics contain a table that tells us to do that:

hmm..

I don’t think it’s quite that simple. My current understanding (given Lindley’s paradox) is that evidence has to be “relative”, in some sense. P-values only tell one side of the story, and are only made to tell one side of the story. If you want a statistic that expresses the strength of evidence you should probably use something else, e.g. Bayes factors.

I am serious. If you think I’m misunderstanding something badly or you just want to discuss or you want to gently point me in the right direction: Please tell me what I’m missing. I don’t have a comment section on this blog, but I’ll post this on bluesky and then update this post to link the post that links this post: Here is the link.

So if you want you can comment, do it over there, or send me an e-mail.

联系我们 contact @ memedata.com