Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
- URL: http://arxiv.org/abs/2602.16819v1
- Date: Wed, 18 Feb 2026 19:30:55 GMT
- Title: Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
- Authors: Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, Daniel Fried,
- Abstract summary: In this paper, we describe some transferable skills that are shared across diverse tasks.<n>We propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks.<n>Experiments show that agents trained on our synthetic tasks effectively generalize diverse real-world tasks.
- Score: 59.95803522351185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.
Related papers
- AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents [49.67355440164857]
We introduce AIRS-Bench, a suite of 20 tasks sourced from state-of-the-art machine learning papers.<n>Airs-Bench tasks assess agentic capabilities over the full research lifecycle.<n>We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
arXiv Detail & Related papers (2026-02-06T16:45:02Z) - WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks [35.99528846296261]
WebGym is the largest-to-date open-source environment for training realistic visual web agents.<n>WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites.
arXiv Detail & Related papers (2026-01-05T09:35:11Z) - Training Versatile Coding Agents in Synthetic Environments [44.5849223659282]
We introduce SWE-Playground, a novel pipeline for generating environments and trajectories.<n>SWE-Playground synthetically generates projects and tasks from scratch with strong language models and agents.<n>This allows us to tackle a much wider variety of coding tasks, such as reproducing issues by generating unit tests and implementing libraries from scratch.
arXiv Detail & Related papers (2025-12-13T07:02:28Z) - GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging [41.754784344572286]
We release GitTaskBench, a benchmark for evaluating code agents in real-world scenarios.<n>Each task pairs a relevant repository with an automated, human-curated evaluation harness.<n>We also propose the alpha-value metric to quantify the economic benefit of agent performance.
arXiv Detail & Related papers (2025-08-26T12:48:05Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling [19.17759446168802]
We study multi-task training of retrieval-augmented generation models for knowledge-intensive tasks.
We filter training examples via a threshold of confidence on the relevance labels, whether a pair is answerable by the knowledge base or not.
arXiv Detail & Related papers (2022-07-07T00:57:02Z) - KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in
Few-Shot NLP [68.43279384561352]
Existing data augmentation algorithms leverage task-independent rules or fine-tune general-purpose pre-trained language models.
These methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks.
We propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks.
arXiv Detail & Related papers (2022-06-21T11:34:02Z) - Combining Modular Skills in Multitask Learning [149.8001096811708]
A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks.
In this work, we assume each task is associated with a subset of latent discrete skills from a (potentially small) inventory.
We find that the modular design of a network significantly increases sample efficiency in reinforcement learning and few-shot generalisation in supervised learning.
arXiv Detail & Related papers (2022-02-28T16:07:19Z) - Environment Generation for Zero-Shot Compositional Reinforcement
Learning [105.35258025210862]
Compositional Design of Environments (CoDE) trains a Generator agent to automatically build a series of compositional tasks tailored to the agent's current skill level.
We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments.
CoDE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks.
arXiv Detail & Related papers (2022-01-21T21:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.