Related papers: PerfBench: Can Agents Resolve Real-World Performance Bugs?

PerfBench: Can Agents Resolve Real-World Performance Bugs?

URL: http://arxiv.org/abs/2509.24091v2
Date: Thu, 16 Oct 2025 17:31:16 GMT
Title: PerfBench: Can Agents Resolve Real-World Performance Bugs?
Authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Neel Sundaresan,
Abstract summary: PerfBench is a benchmark comprising 81 real-world performance bug-fixing tasks from GitHub.<n>PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks.<n>We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a 20% success rate on the benchmark.
Score: 4.879400115033142
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.

Related papers

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads? [0.8749675983608171]
We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference tasks.<n>We curated 54 tasks from merged pull requests with measurable performance improvements.
arXiv Detail & Related papers (2026-02-23T08:37:53Z)
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development [42.26354337364403]
FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.<n>It incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort.<n> Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, achieves a 74.4% resolved rate on SWE-bench.
arXiv Detail & Related papers (2026-02-11T16:06:32Z)
Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z)
PerfGuard: A Performance-Aware Agent for Visual Content Generation [53.591105729011595]
PerfGuard is a performance-aware agent framework for visual content generation.<n>It integrates tool performance boundaries into task planning and scheduling.<n>It has advantages in tool selection accuracy, execution reliability, and alignment with user intent.
arXiv Detail & Related papers (2026-01-30T05:12:19Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? [22.075705411944895]
SWE-fficiency is a benchmark for evaluating repository-level performance optimization on real workloads.<n>Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories.
arXiv Detail & Related papers (2025-11-08T17:55:09Z)
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents [71.85020581835042]
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck.<n>Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.<n>We introduce Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning.
arXiv Detail & Related papers (2025-10-29T16:59:07Z)
From Benchmark Data To Applicable Program Repair: An Experience Report [1.6913109767046948]
This paper describes our approach to automated program repair.<n>We combine various techniques from the literature to achieve this.<n>Experiments show that our approach performs better than other techniques on standard benchmarks.<n>On closer inspection, none of these techniques work on realistic defects that we see in industry.
arXiv Detail & Related papers (2025-08-22T03:59:27Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks.
arXiv Detail & Related papers (2025-06-14T00:25:26Z)
SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.<n>We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.<n>Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z)
PACE: A Program Analysis Framework for Continuous Performance Prediction [0.0]
PACE is a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.
arXiv Detail & Related papers (2023-12-01T20:43:34Z)
DeepPERF: A Deep Learning-Based Approach For Improving Software Performance [8.251500418379942]
We present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in 53% of the cases. We evaluate DeepPERF on 50 open source C# repositories on GitHub.
arXiv Detail & Related papers (2022-06-27T20:35:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.