Related papers: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

URL: http://arxiv.org/abs/2506.19290v1
Date: Tue, 24 Jun 2025 03:53:36 GMT
Title: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
Authors: Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou,
Abstract summary: Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
Score: 19.766885088032932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

Related papers

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks [3.3037205426689433]
Large Language Models (LLMs) in software engineering have revealed critical limitations in existing benchmarks.<n>Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage.<n>We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges.
arXiv Detail & Related papers (2025-07-15T07:52:33Z)
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [53.54568352375669]
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD)<n>Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests.<n>We generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark.
arXiv Detail & Related papers (2025-06-10T17:23:33Z)
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling [39.53265893083118]
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use.<n>To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs.<n> Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents.
arXiv Detail & Related papers (2025-06-09T11:03:16Z)
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [34.16732444158405]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z)
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level.<n>Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness.<n>Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Accelerated Cloud for Artificial Intelligence (ACAI) [24.40451195277244]
We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI) ACAI enables cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking. We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
arXiv Detail & Related papers (2024-01-30T07:09:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.