Related papers: TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

URL: http://arxiv.org/abs/2602.07274v1
Date: Fri, 06 Feb 2026 23:56:50 GMT
Title: TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
Authors: Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, Wenbo Guo,
Abstract summary: TermiGen is an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories.<n>Our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench.
Score: 70.68963723787424
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.

Related papers

ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution [13.109726609738749]
ParEVO is a framework designed to synthesize high-performance parallel algorithms for irregular data.<n>On the ParEval benchmark, ParEVO achieves an average 106x speedup, and a robust 13.6x speedup on complex irregular graph problems.
arXiv Detail & Related papers (2026-03-03T01:41:07Z)
On Data Engineering for Scaling LLM Terminal Capabilities [62.14352406328365]
Training data strategies behind state-of-the-art terminal agents remain largely undisclosed.<n>We address this gap through a systematic study of data engineering practices for terminal agents.<n>Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks.
arXiv Detail & Related papers (2026-02-24T18:51:04Z)
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
Beyond Quantity: Trajectory Diversity Scaling for Code Agents [51.71414642763219]
Trajectory Diversity Scaling is a data synthesis framework for code agents that scales performance through diversity rather than raw volume.<n> TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a blueprint-driven multi-agent paradigm that enforces trajectory coherence; and (3) an adaptive evolution mechanism that steers toward long-tail scenarios.
arXiv Detail & Related papers (2026-02-03T07:43:03Z)
MERGETUNE: Continued fine-tuning of vision-language models [77.8627788911249]
Fine-tuning vision-language models (VLMs) often leads to catastrophic forgetting of pretrained knowledge.<n>We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted.
arXiv Detail & Related papers (2026-01-15T15:15:53Z)
From Failure to Mastery: Generating Hard Samples for Tool-use Agents [40.331752086107265]
HardGen is an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning.<n>The advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT)<n>Our code, models, and dataset will be open-sourced to facilitate future research.
arXiv Detail & Related papers (2026-01-04T11:56:33Z)
GRASP: Guided Residual Adapters with Sample-wise Partitioning [10.504309161945065]
We propose GRASP: Guided Residual Adapters with Sample-wise Partitioning.<n>On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes.
arXiv Detail & Related papers (2025-12-01T13:43:17Z)
Agentic Reinforcement Learning for Real-World Code Repair [7.512134741776294]
We tackle the challenge of training reliable code-fixing agents in real repositories.<n>We developed a verifiable pipeline with success defined as post-fix build validation.<n>We introduced a scalable simplified pipeline for large-scale reinforcement learning.
arXiv Detail & Related papers (2025-10-24T23:25:02Z)
Reinforcement Learning for Machine Learning Engineering Agents [52.03168614623642]
We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
arXiv Detail & Related papers (2025-09-01T18:04:10Z)
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level.<n>Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness.<n>Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z)
STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.