Related papers: LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

URL: http://arxiv.org/abs/2602.14337v1
Date: Sun, 15 Feb 2026 23:12:57 GMT
Title: LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Authors: Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang,
Abstract summary: LongCLI-Bench is a benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks.<n>We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world tasks.<n>Experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench.
Score: 65.11019654023978
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

Related papers

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents [79.29376673236142]
Existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems.<n>We present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents.
arXiv Detail & Related papers (2025-12-14T15:12:13Z)
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation [25.0921056409982]
Single-agent GUI agents struggle to balance high-level capabilities and low-level execution capability.<n>Unlike training a unified policy model, we focus on training high-level scheduling models.<n>We build the Coordinator-Executor-State Tracker framework, which can be integrated with any low-level Executor model.
arXiv Detail & Related papers (2025-11-27T09:01:38Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation [57.12284831164602]
Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks.<n>We propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.
arXiv Detail & Related papers (2025-11-15T15:22:42Z)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios [63.67884284105684]
We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
arXiv Detail & Related papers (2025-09-26T02:04:00Z)
VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [44.99833362998488]
We propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments.<n>The module uses the reasoning capabilities of the Large Language Models to evaluate logical coherence and identify potential gaps in the plan.<n>We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems.
arXiv Detail & Related papers (2025-07-07T15:31:36Z)
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents [33.71705923246233]
GSO is a benchmark for evaluating language models' capabilities in developing high-performance software.<n>SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling.<n>We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
arXiv Detail & Related papers (2025-05-29T17:14:55Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.