SWE基础数据集的一些关键问题

SWE基础数据集的一些关键问题
Some critical issues with the SWE-bench dataset

Aleithan等。介绍了SWE Bench数据集的分析，这是用于评估大语言模型（LLMS）的编码功能的基准测试。他们发现了影响SWE Bench的可靠性的重大问题。通过手动检查Sweagent + GPT-4成功解决SWE基础问题的情况，作者确定了两个主要缺陷：解决方案泄漏，在问题报告或评论中直接暗示了正确的解决方案（占成功补丁的32.67％），以及由于弱测试用例而引起的可疑补丁（31.08％的经过补丁）。消除这些有问题的实例将报告的 + GPT-4的报告从12.47％降低到3.97％。此外，该分析揭示了SWE-Bench的变体，SWE-Bench Lite和SWE-Bench验证的数据质量问题。最后，作者强调，超过94％的问题早于LLM知识截止日期，这引起了人们对潜在数据污染的担忧。这项研究强调了对LLMS的仔细审查和编码基准的仔细审查和完善的必要性。

作者对评估AI生成的代码补丁的研究论文的有效性提出了担忧。他们认为，本文错误地将某些AI贴剂标记为缺乏，理由是特定的Django示例，在某些情况下，AI解决方案在某些情况下看起来正确甚至更高。他们指出，本文通过“ hints_text”的“解决方案泄漏”的主张是错误的，因为SWE-Bench作者明确说明了该输入未使用。然后，作者使用诸如Copilot之类的AI工具分享了他们的积极经验，以用于生成样板，重复代码完成，bash脚本和测试创建等任务。他们强调，AI充当“对类固醇的智能意义”，通过自动化可预测的代码模式来提高生产率。在承认局限性和潜在分心的同时，他们发现AI Intellisense对于特定的编码域高度有价值。最后，作者质疑基准测试方法，特别是关于问题中解决方案的可用性和测试验证的处理。他们还对某些基准参与者的潜在偏见和缺乏透明度表示担忧。

（评论） 2024-08-09

（评论） 2024-08-21

（评论） 2024-08-07

（评论） 2025-02-25

原文

[Submitted on 9 Oct 2024 (v1), last revised 10 Oct 2024 (this version, v2)]

View a PDF of the paper titled SWE-Bench+: Enhanced Coding Benchmark for LLMs, by Reem Aleithan and 5 other authors

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

From: Song Wang [view email]
[v1] Wed, 9 Oct 2024 15:38:53 UTC (3,863 KB)
[v2] Thu, 10 Oct 2024 13:13:09 UTC (1,714 KB)

SWE基础数据集的一些关键问题 Some critical issues with the SWE-bench dataset

SWE基础数据集的一些关键问题
Some critical issues with the SWE-bench dataset