Training Software Engineering Agents and Verifiers with SWE-Gym
- URL: http://arxiv.org/abs/2412.21139v1
- Date: Mon, 30 Dec 2024 18:15:39 GMT
- Title: Training Software Engineering Agents and Verifiers with SWE-Gym
- Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang,
- Abstract summary: SWE-Gym is the first environment for training real-world software engineering (SWE) agents.
SWE-Gym contains 2,438 real-world Python task instances.
- Score: 89.55822534364727
- License:
- Abstract: We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.
Related papers
- SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.
SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.
We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - SWE-Bench+: Enhanced Coding Benchmark for LLMs [7.584728644156347]
The SWE-bench dataset comprises 2,294 real-world GitHub issues and their corresponding pull requests.
The resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.
The same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified.
arXiv Detail & Related papers (2024-10-09T15:38:53Z) - AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing.
We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack.
We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z) - Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents [106.87436596397816]
Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems.
We propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise.
Experiments show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin.
arXiv Detail & Related papers (2024-08-13T17:50:28Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - Score-based Generative Modeling in Latent Space [93.8985523558869]
Score-based generative models (SGMs) have recently demonstrated impressive results in terms of both sample quality and distribution coverage.
Here, we propose the Latent Score-based Generative Model (LSGM), a novel approach that trains SGMs in a latent space.
Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space.
arXiv Detail & Related papers (2021-06-10T17:26:35Z) - Learning Synthetic Environments for Reinforcement Learning with
Evolution Strategies [34.13101380723782]
This work explores learning agent-agnostic synthetic environments (SEs) for Reinforcement Learning.
SEs act as a proxy for target environments and allow agents to be trained more efficiently than when directly trained on the target environment.
We show that our method is capable of learning SEs for two discrete-action-space tasks that allow us to train agents more robustly and with up to 60% fewer steps.
arXiv Detail & Related papers (2021-01-24T14:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.