By ICML 2026 Program Chairs Alekh Agarwal, Miroslav Dudik, Sharon Li, Martin Jaggi, Scientific Integrity Chair Nihar B. Shah, and Communications Chairs Katherine Gorman and Gautam Kamath.
AI has increasingly become a valuable part of researchers’ workflows. Unfortunately, AI has the potential to hurt the integrity of peer review if improperly used. Conferences must adapt, creating rules and policies to handle the new normal, and taking disciplinary action against those who break the rules and violate the trust that we all place in the review process.
ICML is actively working to adapt. This year, we desk-rejected 497 papers (~2% of all submissions), corresponding to submissions of the 506 reciprocal reviewers who violated the rules regarding LLM usage that they had previously explicitly agreed to.
ICML 2026 has two policies regarding LLM use in reviewing:
- Policy A (Conservative): No LLM use allowed.
- Policy B (Permissive): LLMs allowed to help understand the paper and related works, and polish reviews.
This two-policy framework was formed based on community preferences and feedback — indeed, the community is divided on the best way to use LLMs in peer review, with issues such as author consent colliding with preferred reviewer workflows. Further details on the policy are available here.
After a selection process, in which reviewers got to choose which policy they would like to operate under, they were assigned to either Policy A or Policy B. In the end, based on author demands and reviewer signups, the only reviewers who were assigned to Policy A (no LLMs) were those who explicitly selected “Policy A” or “I am okay with either [Policy] A or B.” To be clear, no reviewer who strongly preferred Policy B was assigned to Policy A.
795 reviews (~1% of all reviews) written by 506 unique reviewers who were assigned Policy A (no LLMs) were detected to have used LLMs in their review. Again, recall that these are reviewers who explicitly agreed to not use LLMs in their reviews. The method used is described below, and generic AI-text detectors were not used. Every flagged instance was manually verified by a human, in order to avoid false positives.
If the designated Reciprocal Reviewer for a submission produced such a review, their submission was rejected. In total, this resulted in 497 rejections. All Policy A (no LLMs) reviews that were detected to be LLM generated were removed from the system. If more than half of the reviews submitted by a Policy A reviewer were detected to be LLM generated, then all of their reviews were deleted, and the reviewer themselves was removed from the reviewer pool. 51 Policy A reviewers were detected to have used LLMs in more than half of their reviews, which is about 10% of the total of 506 detected reviewers.
To be clear, we are not making a judgment call about the quality of flagged reviews or the reviewers’ intentions. This is simply a statement that the reviewer used an LLM at some point when composing the review, which is unfortunately a violation of the policy they agreed to abide by.
We regret the disruption this will cause in the peer review process. We have been in direct communication with SACs and ACs impacted, and offered support where we can. The reviews that violated policy have been removed, and ACs may need to find new reviewers. Some submissions which already received a full set of reviews have been desk rejected. And some reviewers whose submissions have been desk rejected may become unresponsive.
At a high level, the LLM detection involved watermarking submission PDFs with hidden LLM instructions, which would subtly influence any review produced via an LLM. Note that this is not a difficult measure to circumvent, particularly if it is known publicly (which was the case for almost the entire review period). Indeed, it may only catch some of the most egregious and careless uses of LLMs in reviewing, where the reviewer is inputting the PDF to the LLM and then directly copy-pasting the output from the LLM. Action was only taken for reviews that were explicitly written by reviewers who agreed to not use LLMs (Policy A). Despite all these caveats, 795 reviews (~1% of all reviews) were found to be violations of the policy.
The method used was based on recent work by Rao, Kumar, Lakkaraju, and Shah. First, we created a dictionary of 170,000 phrases. For each paper, we sampled two phrases randomly from this dictionary. The probability with which a given pair of phrases is picked is thus smaller than one in ten billion. We watermarked the PDF of each paper submitted with instructions, visible only to an LLM, instructing it to include the two selected phrases in the review. (A human reading the PDF would not directly see this watermark.)
This technique will not always be successful in detecting LLM generated reviews. The reviewer could discover the watermark and remove it or work around it. The review text could be modified. The LLM may simply ignore the hidden instructions. In experiments shortly before the submission deadline, frontier LLMs generally (though not always) followed the injected instructions. Success rates were over 80% for most models, potentially depending on the method chosen by some LLMs to read the PDF.
Reviewers may be understandably concerned that their reviews could be wrongly flagged. As described in the paper that outlines the approach we used, false positive rates are strongly controlled. Every flagged instance was also manually inspected by a human to ensure that the review wasn’t simply mentioning that the watermark was present.
Reviewers can fall short of our expectations in many ways, with or without AI involved. This initiative focused only on one particular action (breaking previously agreed-upon rules for LLM usage) and still identified it for ~1% of all reviews.
We hope that by taking strong action against violations of agreed-upon policy we will remind the community that as our field changes rapidly the thing we must protect most actively is our trust in each other. If we cannot adapt our systems in a setting based in trust, we will find that they soon become outdated and meaningless.