RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
- URL: http://arxiv.org/abs/2508.01550v2
- Date: Wed, 03 Sep 2025 14:56:47 GMT
- Title: RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
- Authors: Zhilong Chen, Chengzong Zhao, Boyuan Chen, Dayi Lin, Yihao Chen, Arthur Leung, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Haoxiang Zhang, Aaditya Bhatia, Chong Chun Yong, Ahmed E. Hassan,
- Abstract summary: Training software engineering (SWE) LLMs are bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control.<n>We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale.
- Score: 15.199441664697988
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4\% on SWE-Bench-Verified~\citep{swebench_verified2024}, establishing new state-of-the-art for $\leq$8B non-thinking LLMs; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14$\times$ storage reduction (1.4GB $\rightarrow$ 102MB per instance) via intelligent dependency management and image pruning; (4) $>$70\% faster evaluation using a Ray-powered~\citep{ray2018} distributed RepoForge harness; (5) 19,000$\times$ cheaper labeling through our automated SPICE~\citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $\leq$8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.
Related papers
- Pull Requests as a Training Signal for Repo-Level Code Editing [49.82435173554125]
Clean Pull Request (Clean-PR) is a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing.<n>We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation.<n>On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified.
arXiv Detail & Related papers (2026-02-07T09:22:25Z) - SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning [13.174004826305255]
Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration.<n>We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller.<n>SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97%.
arXiv Detail & Related papers (2026-01-29T23:08:15Z) - RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z) - $π_\ exttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models [76.66547858171452]
$pi_textRL$ is an open-source framework for training flow-based Vision-Language-Action (VLA) models in parallel simulation.<n>$pi_textRL$ boosts few-shot SFT models $pi_0.5$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively.<n>In ManiSkill, we train $pi_textRL$ in 320 parallel environments, improving $pi_textRL$ from 41.6% to 85.7% and $pi_0.5
arXiv Detail & Related papers (2025-10-29T18:37:39Z) - Agentic Reinforcement Learning for Real-World Code Repair [7.512134741776294]
We tackle the challenge of training reliable code-fixing agents in real repositories.<n>We developed a verifiable pipeline with success defined as post-fix build validation.<n>We introduced a scalable simplified pipeline for large-scale reinforcement learning.
arXiv Detail & Related papers (2025-10-24T23:25:02Z) - Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation [6.240627892585199]
We present a production-oriented Big Data as a Service (BD) blueprint that integrates a singlenode serverless GPU runtime with TabNet.<n>We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines.<n>Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines.
arXiv Detail & Related papers (2025-10-22T15:37:42Z) - Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels [96.35283762778137]
We introduce the Webscale-RL pipeline, a scalable data engine for reinforcement learning.<n>We construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains.<n>Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
arXiv Detail & Related papers (2025-10-07T22:30:59Z) - Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z) - SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [34.8513098099929]
SWE-Factory is an automated pipeline designed to create large-scale GitHub issue resolution datasets.<n>SWE-Builder is a multi-agent system that automates evaluation environment construction.<n> exit-code-based grading achieves 100% accuracy compared to manual inspection.
arXiv Detail & Related papers (2025-06-12T17:54:17Z) - Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z) - Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers [57.95157497749428]
We propose RL$V$ that augments any value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier.<n> RL$V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32times$ efficient test-time compute scaling.
arXiv Detail & Related papers (2025-05-07T22:41:26Z) - Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining.<n>We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.<n>We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z) - Simple ReFlow: Improved Techniques for Fast Flow Models [68.32300636049008]
Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps.
We propose seven improvements for training dynamics, learning and inference.
We achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs.
arXiv Detail & Related papers (2024-10-10T11:00:55Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep
Recommendation Models [3.7414278978078204]
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems.
The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion.
arXiv Detail & Related papers (2023-08-13T18:28:56Z) - Efficient Deep Learning Pipelines for Accurate Cost Estimations Over
Large Scale Query Workload [25.52190205651031]
We develop a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces.
We evaluate our pipeline over 19K Presto OLAP queries from Grab, on a data lake of more than 20PB of data.
We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure.
arXiv Detail & Related papers (2021-03-23T11:36:10Z) - PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers [47.194426122333205]
PipeTransformer is a distributed training algorithm for Transformer models.
It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training.
We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
arXiv Detail & Related papers (2021-02-05T13:39:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.