2%的ICML论文因作者在评审中使用LLM而被直接拒稿。
2% of ICML papers desk rejected because the authors used LLM in their reviews

原始链接: https://blog.icml.cc/2026/03/18/on-violations-of-llm-review-policies/

## ICML 2026 与人工智能在同行评审中的应用:维护诚信 ICML 2026会议正在积极应对人工智能对同行评审诚信带来的挑战。 认识到潜在的滥用风险,ICML实施了两级政策:**政策A**(禁止使用LLM)和**政策B**(允许使用LLM理解和润色评论)。 评审员选择他们偏好的政策,没有人会被强制执行政策A,如果他们更喜欢政策B。 尽管有明确的协议,ICML检测到在分配到政策A的评审员提交的795份(约1%)评论中使用了LLM。 这是通过一种新颖的水印技术实现的——在论文PDF中嵌入隐藏指令,要求LLM在评论中包含特定短语。 所有标记的实例都经过手动验证,以避免误报。 因此,与违规评审员相关的497篇投稿被直接拒稿,51名评审员被从评审池中移除。 ICML强调这并非是对评论*质量*的判断,而是对信任被破坏的回应。 虽然水印方法并非万无一失,但它成功地识别了大量政策违规行为。 ICML承认对评审过程的干扰,并正在支持受影响的领域主席和作者。 这一坚定立场旨在维护信任的基础,这对于一个功能正常的同行评审系统至关重要,因为人工智能的整合正在不断发展。

## ICML论文审查丑闻:LLM使用与学术诚信 最近,国际机器学习会议(ICML)的一项调查显示,2%的论文因其评审员违反了禁止使用大型语言模型(LLM)的政策而被直接拒稿。评审员可以选择加入“禁止LLM”政策,但发现那些加入政策的评审员使用了LLM来生成他们的审查意见。 ICML采用了一种巧妙的检测方法:在PDF中嵌入隐藏指令,提示LLM插入特定短语。包含这两个短语的审查意见会被标记出来。这并非关于检测LLM用于编辑的通用使用情况,而是专门识别那些复制粘贴LLM生成内容作为自己原创内容的人。 这一事件引发了关于学术诚信、对LLM日益增长的依赖以及研究界内部压力的争论。一些人主张严厉惩罚,而另一些人则认为应该关注调整审查流程,以有效地适应LLM。这种情况凸显了随着人工智能工具的普及,在同行评审中维持信任和质量的挑战。许多人认为这仅仅是个开始,目前的检测方法长期来看将不再有效。
相关文章

原文

By ICML 2026 Program Chairs Alekh Agarwal, Miroslav Dudik, Sharon Li, Martin Jaggi, Scientific Integrity Chair Nihar B. Shah, and Communications Chairs Katherine Gorman and Gautam Kamath.

AI has increasingly become a valuable part of researchers’ workflows. Unfortunately, AI has the potential to hurt the integrity of peer review if improperly used. Conferences must adapt, creating rules and policies to handle the new normal, and taking disciplinary action against those who break the rules and violate the trust that we all place in the review process. 

ICML is actively working to adapt. This year, we desk-rejected 497 papers (~2% of all submissions), corresponding to submissions of the 506 reciprocal reviewers who violated the rules regarding LLM usage that they had previously explicitly agreed to. 

ICML 2026 has two policies regarding LLM use in reviewing:

  • Policy A (Conservative): No LLM use allowed.
  • Policy B (Permissive): LLMs allowed to help understand the paper and related works, and polish reviews.

This two-policy framework was formed based on community preferences and feedback — indeed, the community is divided on the best way to use LLMs in peer review, with issues such as author consent colliding with preferred reviewer workflows. Further details on the policy are available here

After a selection process, in which reviewers got to choose which policy they would like to operate under, they were assigned to either Policy A or Policy B. In the end, based on author demands and reviewer signups, the only reviewers who were assigned to Policy A (no LLMs) were those who explicitly selected “Policy A” or “I am okay with either [Policy] A or B.” To be clear, no reviewer who strongly preferred Policy B was assigned to Policy A. 

795 reviews (~1% of all reviews) written by 506 unique reviewers who were assigned Policy A (no LLMs) were detected to have used LLMs in their review. Again, recall that these are reviewers who explicitly agreed to not use LLMs in their reviews. The method used is described below, and generic AI-text detectors were not used. Every flagged instance was manually verified by a human, in order to avoid false positives. 

If the designated Reciprocal Reviewer for a submission produced such a review, their submission was rejected. In total, this resulted in 497 rejections. All Policy A (no LLMs) reviews that were detected to be LLM generated were removed from the system. If more than half of the reviews submitted by a Policy A reviewer were detected to be LLM generated, then all of their reviews were deleted, and the reviewer themselves was removed from the reviewer pool. 51 Policy A reviewers were detected to have used LLMs in more than half of their reviews, which is about 10% of the total of 506 detected reviewers. 

To be clear, we are not making a judgment call about the quality of flagged reviews or the reviewers’ intentions. This is simply a statement that the reviewer used an LLM at some point when composing the review, which is unfortunately a violation of the policy they agreed to abide by.

We regret the disruption this will cause in the peer review process. We have been in direct  communication with SACs and ACs impacted, and offered support where we can. The reviews that violated policy have been removed, and ACs may need to find new reviewers. Some submissions which already received a full set of reviews have been desk rejected. And some reviewers whose submissions have been desk rejected may become unresponsive. 

At a high level, the LLM detection involved watermarking submission PDFs with hidden LLM instructions, which would subtly influence any review produced via an LLM. Note that this is not a difficult measure to circumvent, particularly if it is known publicly (which was the case for almost the entire review period). Indeed, it may only catch some of the most egregious and careless uses of LLMs in reviewing, where the reviewer is inputting the PDF to the LLM and then directly copy-pasting the output from the LLM. Action was only taken for reviews that were explicitly written by reviewers who agreed to not use LLMs (Policy A). Despite all these caveats, 795 reviews (~1% of all reviews) were found to be violations of the policy. 

The method used was based on recent work by Rao, Kumar, Lakkaraju, and Shah. First, we created a dictionary of 170,000 phrases. For each paper, we sampled two phrases randomly from this dictionary. The probability with which a given pair of phrases is picked is thus smaller than one in ten billion. We watermarked the PDF of each paper submitted with instructions, visible only to an LLM, instructing it to include the two selected phrases in the review. (A human reading the PDF would not directly see this watermark.) 

This technique will not always be successful in detecting LLM generated reviews. The reviewer could discover the watermark and remove it or work around it. The review text could be modified. The LLM may simply ignore the hidden instructions. In experiments shortly before the submission deadline, frontier LLMs generally (though not always) followed the injected instructions. Success rates were over 80% for most models, potentially depending on the method chosen by some LLMs to read the PDF. 

Reviewers may be understandably concerned that their reviews could be wrongly flagged. As described in the paper that outlines the approach we used, false positive rates are strongly controlled. Every flagged instance was also manually inspected by a human to ensure that the review wasn’t simply mentioning that the watermark was present. 

Reviewers can fall short of our expectations in many ways, with or without AI involved. This initiative focused only on one particular action (breaking previously agreed-upon rules for LLM usage) and still identified it for ~1% of all reviews. 

We hope that by taking strong action against violations of agreed-upon policy we will remind the community that as our field changes rapidly the thing we must protect most actively is our trust in each other. If we cannot adapt our systems in a setting based in trust, we will find that they soon become outdated and meaningless. 

联系我们 contact @ memedata.com