Related papers: SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

URL: http://arxiv.org/abs/2506.07636v1
Date: Mon, 09 Jun 2025 11:03:16 GMT
Title: SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Authors: Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, Yuxiao Dong,
Abstract summary: Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use.<n>To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs.<n> Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents.
Score: 39.53265893083118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.

Related papers

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z)
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [53.54568352375669]
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD)<n>Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests.<n>We generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark.
arXiv Detail & Related papers (2025-06-10T17:23:33Z)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [34.16732444158405]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z)
SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [40.48114055515786]
SWE-Dev is the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems.<n>It provides high-quality data forSupervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests.
arXiv Detail & Related papers (2025-05-22T17:51:49Z)
SWE-smith: Scaling Data for Software Engineering Agents [100.30273957706237]
SWE-smith is a novel pipeline for generating software engineering training data at scale.<n>We create a dataset of 50k instances sourced from 128 GitHub repositories.<n>We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark.
arXiv Detail & Related papers (2025-04-30T16:56:06Z)
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level.<n>Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness.<n>Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Training Software Engineering Agents and Verifiers with SWE-Gym [89.55822534364727]
SWE-Gym is the first environment for training real-world software engineering (SWE) agents.<n>SWE-Gym contains 2,438 real-world Python task instances.<n>We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate.
arXiv Detail & Related papers (2024-12-30T18:15:39Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.