Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments
- URL: http://arxiv.org/abs/2602.01244v2
- Date: Tue, 03 Feb 2026 14:03:32 GMT
- Title: Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments
- Authors: Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin,
- Abstract summary: Training agentic models for terminal-based tasks depend on high-quality terminal trajectories that capture realistic long-horizon interactions.<n>We propose textbfTerminalTraj, a scalable pipeline that generates Docker-aligned task instances and synthesizes agent trajectories with executable validation code.<n>Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains.
- Score: 36.81059045059001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.
Related papers
- On Data Engineering for Scaling LLM Terminal Capabilities [62.14352406328365]
Training data strategies behind state-of-the-art terminal agents remain largely undisclosed.<n>We address this gap through a systematic study of data engineering practices for terminal agents.<n>Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks.
arXiv Detail & Related papers (2026-02-24T18:51:04Z) - CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion [26.52253286270211]
Agentic coding requires agents to interact with runtime environments, e.g., command line interfaces (CLI)<n>We propose to employ agents to simulate and explore environment histories, guided by execution feedback.<n>With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind.
arXiv Detail & Related papers (2026-02-11T16:22:18Z) - TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents [70.68963723787424]
TermiGen is an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories.<n>Our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench.
arXiv Detail & Related papers (2026-02-06T23:56:50Z) - ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z) - Endless Terminals: Scaling RL Environments for Terminal Agents [39.60665149203152]
Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
arXiv Detail & Related papers (2026-01-23T04:39:55Z) - GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z) - From Editor to Dense Geometry Estimator [77.21804448599009]
We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
arXiv Detail & Related papers (2025-09-04T15:58:50Z) - Distributed Training under Packet Loss [8.613477072763404]
Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped.<n>We introduce a principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss.<n>This work bridges the gap between communication-efficient protocols and the accuracy and guarantees demanded by modern large-model training.
arXiv Detail & Related papers (2025-07-02T11:07:20Z) - Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z) - APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets [99.8988504388011]
APIGen is an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.
We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets.
We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains.
arXiv Detail & Related papers (2024-06-26T17:49:11Z) - Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.