Related papers: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

URL: http://arxiv.org/abs/2602.23866v1
Date: Fri, 27 Feb 2026 10:06:10 GMT
Title: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Authors: Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev,
Abstract summary: SWE-rebench V2 is an automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale.<n>We construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution.<n>To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata.
Score: 39.33317467753191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

Related papers

Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [60.359983359258955]
ScaleSWE is an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale.<n>The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories.
arXiv Detail & Related papers (2026-02-10T15:30:19Z)
SWE-World: Building Software Engineering Agents in Docker-Free Environments [91.17484806743641]
SWE-World is a Docker-free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents.<n>We show that SWE-World raises Qwen2.5-Coder-32B from 6.2% to 52.0% via Docker-free SFT, 55.0% with Docker-free RL, and 68.2% with further TTS.
arXiv Detail & Related papers (2026-02-03T11:44:39Z)
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels [96.35283762778137]
We introduce the Webscale-RL pipeline, a scalable data engine for reinforcement learning.<n>We construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains.<n>Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
arXiv Detail & Related papers (2025-10-07T22:30:59Z)
Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes [33.80591142965565]
We present CODE2BENCH, a pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories.<n>Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels; and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites.
arXiv Detail & Related papers (2025-08-10T05:06:36Z)
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language [48.79534869177174]
We introduce a new pre-training dataset curation pipeline based on FineWeb.<n>We show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets.<n>We scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset.
arXiv Detail & Related papers (2025-06-26T01:01:47Z)
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [31.921127664873882]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z)
XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation [80.18830380517753]
We develop a new task-agnostic distillation framework XtremeDistilTransformers. We study the transferability of several source tasks, augmentation resources and model architecture for distillation.
arXiv Detail & Related papers (2021-06-08T17:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.