Related papers: Endless Terminals: Scaling RL Environments for Terminal Agents

Endless Terminals: Scaling RL Environments for Terminal Agents

URL: http://arxiv.org/abs/2601.16443v2
Date: Tue, 27 Jan 2026 03:34:47 GMT
Title: Endless Terminals: Scaling RL Environments for Terminal Agents
Authors: Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos,
Abstract summary: Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
Score: 39.60665149203152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

Related papers

On Data Engineering for Scaling LLM Terminal Capabilities [62.14352406328365]
Training data strategies behind state-of-the-art terminal agents remain largely undisclosed.<n>We address this gap through a systematic study of data engineering practices for terminal agents.<n>Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks.
arXiv Detail & Related papers (2026-02-24T18:51:04Z)
EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments [0.10934862523101825]
We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution.<n>We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments.
arXiv Detail & Related papers (2026-02-18T04:35:46Z)
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters [169.7981969517903]
Step 3.5 Flash bridges frontier-level agentic intelligence and computational efficiency.<n>We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.
arXiv Detail & Related papers (2026-02-11T07:53:51Z)
CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability [50.57373283154859]
We present CVE-Factory, the first multiagent framework to achieve expert-level quality in automatically transforming vulnerability tasks.<n>It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success.<n>We synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security.
arXiv Detail & Related papers (2026-02-03T02:27:16Z)
Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments [36.81059045059001]
Training agentic models for terminal-based tasks depend on high-quality terminal trajectories that capture realistic long-horizon interactions.<n>We propose textbfTerminalTraj, a scalable pipeline that generates Docker-aligned task instances and synthesizes agent trajectories with executable validation code.<n>Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains.
arXiv Detail & Related papers (2026-02-01T14:09:23Z)
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents [33.46555542782679]
MAI-UI is a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants.<n>We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, and the absence of a practical deployment architecture.
arXiv Detail & Related papers (2025-12-26T14:51:52Z)
Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement [8.230420096371407]
We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA's Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees.<n>We built a closed-loop system that addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning.<n>For routing, we replaced a Llama 3.1B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10x reduction in model size, and 70% latency improvement.
arXiv Detail & Related papers (2025-10-30T23:41:06Z)
Agentic Reinforcement Learning for Real-World Code Repair [7.512134741776294]
We tackle the challenge of training reliable code-fixing agents in real repositories.<n>We developed a verifiable pipeline with success defined as post-fix build validation.<n>We introduced a scalable simplified pipeline for large-scale reinforcement learning.
arXiv Detail & Related papers (2025-10-24T23:25:02Z)
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action [77.63125913907771]
We present UltraCUA, a foundation model that bridges the gap between GUI primitives and high-level programmatic tool calls.<n>Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents.
arXiv Detail & Related papers (2025-10-20T17:48:26Z)
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z)
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [34.8513098099929]
SWE-Factory is an automated pipeline designed to create large-scale GitHub issue resolution datasets.<n>SWE-Builder is a multi-agent system that automates evaluation environment construction.<n> exit-code-based grading achieves 100% accuracy compared to manual inspection.
arXiv Detail & Related papers (2025-06-12T17:54:17Z)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO. Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.