Related papers: Training Software Engineering Agents and Verifiers with SWE-Gym

Training Software Engineering Agents and Verifiers with SWE-Gym

URL: http://arxiv.org/abs/2412.21139v1
Date: Mon, 30 Dec 2024 18:15:39 GMT
Title: Training Software Engineering Agents and Verifiers with SWE-Gym
Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang,
Abstract summary: SWE-Gym is the first environment for training real-world software engineering (SWE) agents.<n>SWE-Gym contains 2,438 real-world Python task instances.
Score: 89.55822534364727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

Related papers

SWE-smith: Scaling Data for Software Engineering Agents [100.30273957706237]
SWE-smith is a novel pipeline for generating software engineering training data at scale. We create a dataset of 50k instances sourced from 128 GitHub repositories. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark.
arXiv Detail & Related papers (2025-04-30T16:56:06Z)
Iterative Trajectory Exploration for Multimodal Agents [69.32855772335624]
We propose an online self-exploration method for multimodal agents, namely SPORT. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements.
arXiv Detail & Related papers (2025-04-30T12:01:27Z)
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [32.06393076572057]
AgentGym is the largest procedurally-curated executable gym environment for training real-world SWE-agents. It is powered by two main contributions: SYNGEN, a synthetic data curation recipe, and Hybrid Test-time Scaling. Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents.
arXiv Detail & Related papers (2025-04-09T17:55:19Z)
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.253353551910404]
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices. We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
arXiv Detail & Related papers (2025-04-01T15:40:27Z)
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model. It offers fine-grained signals for agent training and can choose better action for inference-time scaling. We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
Automated Benchmark Generation for Repository-Level Coding Tasks [7.305342793164905]
SetUpAgent is a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. We generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries.
arXiv Detail & Related papers (2025-03-10T17:42:49Z)
SWE-Bench+: Enhanced Coding Benchmark for LLMs [7.584728644156347]
The SWE-bench dataset comprises 2,294 real-world GitHub issues and their corresponding pull requests. The resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. The same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified.
arXiv Detail & Related papers (2024-10-09T15:38:53Z)
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents [106.87436596397816]
Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. We propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. Experiments show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin.
arXiv Detail & Related papers (2024-08-13T17:50:28Z)
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents. We take the first step towards building generally-capable LLM-based agents with self-evolution ability. We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)
Learning Synthetic Environments for Reinforcement Learning with Evolution Strategies [34.13101380723782]
This work explores learning agent-agnostic synthetic environments (SEs) for Reinforcement Learning. SEs act as a proxy for target environments and allow agents to be trained more efficiently than when directly trained on the target environment. We show that our method is capable of learning SEs for two discrete-action-space tasks that allow us to train agents more robustly and with up to 60% fewer steps.
arXiv Detail & Related papers (2021-01-24T14:16:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.