Related papers: OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

URL: http://arxiv.org/abs/2601.10343v2
Date: Fri, 16 Jan 2026 02:10:35 GMT
Title: OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Authors: Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, Tao Gui,
Abstract summary: We introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding.<n> OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items.<n>Experiments reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation.
Score: 57.39403818250357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

Related papers

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing [1.4069797812477614]
We introduce a benchmarking framework for repository-level code modernization built on an implementation-agnostic evaluation paradigm.<n>RepoMod-Bench is a benchmark of 21 real-world repositories with standardized interfaces, spanning 8 languages.<n>The benchmark contains 1.6M lines of code (LOC) and 11,616 tests, with repository sizes ranging from 14 to 211K LOC.
arXiv Detail & Related papers (2026-02-26T01:25:00Z)
CodeCompass: Navigating the Navigation Paradox in Agentic Code Intelligence [0.0]
We identify the Navigation Paradox: agents perform poorly because navigation and retrieval are fundamentally distinct problems.<n>We demonstrate that graph-based structural navigation via Code--a Model Context Protocol server exposing dependency graphs--achieves 99.4% task completion on hidden-dependency tasks.
arXiv Detail & Related papers (2026-02-23T16:58:37Z)
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development [42.26354337364403]
FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.<n>It incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort.<n> Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, achieves a 74.4% resolved rate on SWE-bench.
arXiv Detail & Related papers (2026-02-11T16:06:32Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
RepoSummary: Feature-Oriented Summarization and Documentation Generation for Code Repositories [7.744086870383438]
RepoSummary is a feature-oriented code repository summarization approach.<n>It simultaneously generates repository documentation automatically.<n>It establishes more accurate traceability links from functional features to the corresponding code elements.
arXiv Detail & Related papers (2025-10-13T06:16:44Z)
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging [41.754784344572286]
We release GitTaskBench, a benchmark for evaluating code agents in real-world scenarios.<n>Each task pairs a relevant repository with an automated, human-curated evaluation harness.<n>We also propose the alpha-value metric to quantify the economic benefit of agent performance.
arXiv Detail & Related papers (2025-08-26T12:48:05Z)
Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles [1.387448620257867]
Large Language Models (LLMs) have shown strong capabilities in code generation and comprehension, yet their application to complex software engineering tasks often suffers from low precision and limited interpretability.<n>We present Repeton, a fully open-source framework that leverages LLMs for precise and automated code manipulation in real-world Git.
arXiv Detail & Related papers (2025-06-09T19:36:40Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.