SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
- URL: http://arxiv.org/abs/2507.11059v2
- Date: Thu, 17 Jul 2025 14:04:07 GMT
- Title: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
- Authors: Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh,
- Abstract summary: Large Language Models (LLMs) in software engineering have revealed critical limitations in existing benchmarks.<n>Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage.<n>We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges.
- Score: 3.3037205426689433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.
Related papers
- LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z) - SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling [39.53265893083118]
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use.<n>To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs.<n> Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents.
arXiv Detail & Related papers (2025-06-09T11:03:16Z) - SWE-bench Goes Live! [39.295587503671015]
We present SWE-bench-Live, a live-updatable benchmark for large language models (LLMs)<n>Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories.<n>Central to our benchmark is method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup.
arXiv Detail & Related papers (2025-05-29T13:09:44Z) - SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [34.16732444158405]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.<n>We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [62.94719119451089]
Lingma SWE-GPT series learns from and simulating real-world code submission activities.
Lingma SWE-GPT 72B resolves 30.20% of GitHub issues, marking a significant improvement in automatic issue resolution.
arXiv Detail & Related papers (2024-11-01T14:27:16Z) - Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends [49.68011233558541]
Large autoregressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks.
We introduce Maverick, a pipeline that enables running a state-of-the-art Coreference Resolution system within the constraints of an academic budget.
Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a 170x faster inference compared to previous state-of-the-art systems.
arXiv Detail & Related papers (2024-07-31T09:58:48Z) - LiveBench: A Challenging, Contamination-Limited LLM Benchmark [93.57775429120488]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.<n>We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size.<n>Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.