Related papers: CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

URL: http://arxiv.org/abs/2602.10999v1
Date: Wed, 11 Feb 2026 16:22:18 GMT
Title: CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion
Authors: Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, Dandan Tu,
Abstract summary: Agentic coding requires agents to interact with runtime environments, e.g., command line interfaces (CLI)<n>We propose to employ agents to simulate and explore environment histories, guided by execution feedback.<n>With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind.
Score: 26.52253286270211
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

Related papers

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via Agent-to-Agent Communication from CORAL [0.15199492741752027]
We propose an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication.<n>We evaluate our approach on the general-purpose benchmark GAIA, using the representative workflow-based MAS as the baseline.<n>Our method achieves 63.64% accuracy, outperforming OWL's 55.15% by 8.49 percentage points with comparable token consumption.
arXiv Detail & Related papers (2026-01-14T21:35:51Z)
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
CaveAgent: Transforming LLMs into Stateful Runtime Operators [31.548422546991915]
We present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-As-Runtime-Runtime"<n>CaveAgent achieves a 10.5% success rate improvement on retail tasks and reduces total token consumption by 28.4% in multi-turn scenarios.
arXiv Detail & Related papers (2026-01-04T15:32:47Z)
CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL [35.086788669916594]
Large language model based agents are increasingly deployed in complex, tool augmented environments.<n>Existing approaches typically assume predefined task collections, an assumption that fails in novel environments.<n>We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks.
arXiv Detail & Related papers (2025-12-01T06:11:37Z)
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents [71.85020581835042]
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck.<n>Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.<n>We introduce Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning.
arXiv Detail & Related papers (2025-10-29T16:59:07Z)
Generalizable End-to-End Tool-Use RL with Synthetic CodeGym [52.31172214690965]
We introduce CodeGym, a framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL.<n>CodeGym rewrites static coding problems into interactive environments by extracting atomic functions or logic into callable tools.<n>Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability.
arXiv Detail & Related papers (2025-09-22T03:03:56Z)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [31.921127664873882]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z)
Repo2Run: Automated Building Executable Environment for Code Repository at Scale [10.143091612327602]
We introduce Repo2Run, an agent aiming at automating the building of executable test environments for any repositories at scale.<n>Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile.<n>The resulting Dockerfile can then be used to create Docker container environments for running code and tests.
arXiv Detail & Related papers (2025-02-19T12:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.