Related papers: SWE-Bench+: Enhanced Coding Benchmark for LLMs

SWE-Bench+: Enhanced Coding Benchmark for LLMs

URL: http://arxiv.org/abs/2410.06992v2
Date: Thu, 10 Oct 2024 13:13:09 GMT
Title: SWE-Bench+: Enhanced Coding Benchmark for LLMs
Authors: Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, Song Wang,
Abstract summary: The SWE-bench dataset comprises 2,294 real-world GitHub issues and their corresponding pull requests. The resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. The same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified.
Score: 7.584728644156347
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

Related papers

Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [20.46588369793562]
Most popular benchmarks for automated issue solving are SWE-bench and its human-filtered subset SWE-bench Verified. This paper presents an in-depth empirical study of the correctness of plausible patches generated by three state-of-the-art issue-solving tools evaluated on SWE-bench Verified.
arXiv Detail & Related papers (2025-03-19T14:02:21Z)
LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks [15.584759853972992]
Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. Their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage. This paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs.
arXiv Detail & Related papers (2025-02-10T07:33:49Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
Training Software Engineering Agents and Verifiers with SWE-Gym [89.55822534364727]
SWE-Gym is the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances.
arXiv Detail & Related papers (2024-12-30T18:15:39Z)
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark [57.999567012489706]
We propose a contamination-free and more challenging benchmark called MMLU-CF. This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set.
arXiv Detail & Related papers (2024-12-19T18:58:04Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Revisiting Multi-Modal LLM Evaluation [29.094387692681337]
We pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.
arXiv Detail & Related papers (2024-08-09T20:55:46Z)
LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation [39.857198257988685]
Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. They are prone to hallucinations, generating claims that contradict established facts, and producing inconsistent responses when the same prompt is presented multiple times. This paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains.
arXiv Detail & Related papers (2024-06-13T14:18:13Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Increasing the LLM Accuracy for Question Answering: Ontologies to the Rescue! [1.0786522863027366]
We present an approach that consists of 1) Ontology-based Query Check (OBQC) and 2) LLM Repair. Our approach increases the overall accuracy to 72% including an additional 8% of "I don't know" unknown results.
arXiv Detail & Related papers (2024-05-20T00:28:00Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines [15.389579061898429]
We present SPADE, a method for automatically synthesizing data quality assertions. In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14%.
arXiv Detail & Related papers (2024-01-05T19:27:58Z)
Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs. We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data. Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.