GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- URL: http://arxiv.org/abs/2505.23671v2
- Date: Sat, 31 May 2025 01:53:06 GMT
- Title: GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- Authors: Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica,
- Abstract summary: GSO is a benchmark for evaluating language models' capabilities in developing high-performance software.<n>SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling.<n>We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
- Score: 19.46051971038257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
Related papers
- HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization [31.908590128913094]
HeuriGym is an agentic framework designed for evaluating algorithms generated by Large Language Models (LLMs)<n>We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning.<n>Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
arXiv Detail & Related papers (2025-06-09T17:46:47Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values.<n>With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value.<n>We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z) - LLM Program Optimization via Retrieval Augmented Search [71.40092732256252]
We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations.<n>We show that RAS performs 1.8$times$ better than prior state-of-the-art blackbox adaptation strategies.<n>We also propose a method called AEGIS for improving interpretability by decomposing training examples into "atomic edits"
arXiv Detail & Related papers (2025-01-31T06:34:47Z) - SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [18.84439000902905]
Current large language model (LLM)-based software agents often follow linear, sequential processes.<n>We propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism.<n>This highlights the potential of self-evaluation driven search techniques in complex software engineering environments.
arXiv Detail & Related papers (2024-10-26T22:45:56Z) - REDO: Execution-Free Runtime Error Detection for COding Agents [3.9903610503301072]
Execution-free Error Detection for COding Agents (REDO) is a method that integrates runtime errors with static analysis tools.
We demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and a 9.1% higher weighted F1 score.
arXiv Detail & Related papers (2024-10-10T18:06:29Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.