SWE-Bench+: Enhanced Coding Benchmark for LLMs
- URL: http://arxiv.org/abs/2410.06992v2
- Date: Thu, 10 Oct 2024 13:13:09 GMT
- Title: SWE-Bench+: Enhanced Coding Benchmark for LLMs
- Authors: Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, Song Wang,
- Abstract summary: The SWE-bench dataset comprises 2,294 real-world GitHub issues and their corresponding pull requests.
The resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.
The same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified.
- Score: 7.584728644156347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.
Related papers
- LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks [15.584759853972992]
Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair.
Their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage.
This paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs.
arXiv Detail & Related papers (2025-02-10T07:33:49Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.
SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.
We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - Training Software Engineering Agents and Verifiers with SWE-Gym [89.55822534364727]
SWE-Gym is the first environment for training real-world software engineering (SWE) agents.
SWE-Gym contains 2,438 real-world Python task instances.
arXiv Detail & Related papers (2024-12-30T18:15:39Z) - MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark [57.999567012489706]
We propose a contamination-free and more challenging benchmark called MMLU-CF.
This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage.
Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set.
arXiv Detail & Related papers (2024-12-19T18:58:04Z) - A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.
It is constructed using a dataset curated from 30 well-known GitHub repositories.
We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z) - Revisiting Multi-Modal LLM Evaluation [29.094387692681337]
We pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones.
Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.
arXiv Detail & Related papers (2024-08-09T20:55:46Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines [15.389579061898429]
We present SPADE, a method for automatically synthesizing data quality assertions.
In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14%.
arXiv Detail & Related papers (2024-01-05T19:27:58Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.