On Data Engineering for Scaling LLM Terminal Capabilities
- URL: http://arxiv.org/abs/2602.21193v1
- Date: Tue, 24 Feb 2026 18:51:04 GMT
- Title: On Data Engineering for Scaling LLM Terminal Capabilities
- Authors: Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping,
- Abstract summary: Training data strategies behind state-of-the-art terminal agents remain largely undisclosed.<n>We address this gap through a systematic study of data engineering practices for terminal agents.<n>Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks.
- Score: 62.14352406328365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
Related papers
- TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents [70.68963723787424]
TermiGen is an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories.<n>Our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench.
arXiv Detail & Related papers (2026-02-06T23:56:50Z) - AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z) - Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments [36.81059045059001]
Training agentic models for terminal-based tasks depend on high-quality terminal trajectories that capture realistic long-horizon interactions.<n>We propose textbfTerminalTraj, a scalable pipeline that generates Docker-aligned task instances and synthesizes agent trajectories with executable validation code.<n>Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains.
arXiv Detail & Related papers (2026-02-01T14:09:23Z) - Endless Terminals: Scaling RL Environments for Terminal Agents [39.60665149203152]
Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
arXiv Detail & Related papers (2026-01-23T04:39:55Z) - Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z) - A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices [18.853357902416832]
Current on-device model training is hampered by low training throughput, limited storage and diverse data importance.<n>We propose a two-stage data selection framework sf Titan to select the most important data batch from streaming data for model training.<n>sf Titan achieves up to $43%$ reduction in training time and $6.2%$ increase in final accuracy with minor system overhead.
arXiv Detail & Related papers (2025-05-22T11:53:48Z) - NeMo-Inspector: A Visualization Tool for LLM Generation Analysis [6.55530159050218]
We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities.<n>We demonstrate its effectiveness through two real-world cases.
arXiv Detail & Related papers (2025-05-01T22:47:06Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.