Related papers: On Data Engineering for Scaling LLM Terminal Capabilities

On Data Engineering for Scaling LLM Terminal Capabilities

URL: http://arxiv.org/abs/2602.21193v1
Date: Tue, 24 Feb 2026 18:51:04 GMT
Title: On Data Engineering for Scaling LLM Terminal Capabilities
Authors: Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping,
Abstract summary: Training data strategies behind state-of-the-art terminal agents remain largely undisclosed.<n>We address this gap through a systematic study of data engineering practices for terminal agents.<n>Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks.
Score: 62.14352406328365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

Related papers

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents [70.68963723787424]
TermiGen is an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories.<n>Our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench.
arXiv Detail & Related papers (2026-02-06T23:56:50Z)
AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z)
Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments [36.81059045059001]
Training agentic models for terminal-based tasks depend on high-quality terminal trajectories that capture realistic long-horizon interactions.<n>We propose textbfTerminalTraj, a scalable pipeline that generates Docker-aligned task instances and synthesizes agent trajectories with executable validation code.<n>Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains.
arXiv Detail & Related papers (2026-02-01T14:09:23Z)
Endless Terminals: Scaling RL Environments for Terminal Agents [39.60665149203152]
Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
arXiv Detail & Related papers (2026-01-23T04:39:55Z)
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z)
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z)
A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices [18.853357902416832]
Current on-device model training is hampered by low training throughput, limited storage and diverse data importance.<n>We propose a two-stage data selection framework sf Titan to select the most important data batch from streaming data for model training.<n>sf Titan achieves up to $43%$ reduction in training time and $6.2%$ increase in final accuracy with minor system overhead.
arXiv Detail & Related papers (2025-05-22T11:53:48Z)
NeMo-Inspector: A Visualization Tool for LLM Generation Analysis [6.55530159050218]
We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities.<n>We demonstrate its effectiveness through two real-world cases.
arXiv Detail & Related papers (2025-05-01T22:47:06Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning. We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.