Theory of Code Space: Do Code Agents Understand Software Architecture?
- URL: http://arxiv.org/abs/2603.00601v2
- Date: Tue, 03 Mar 2026 18:45:08 GMT
- Title: Theory of Code Space: Do Code Agents Understand Software Architecture?
- Authors: Grigory Sapunov,
- Abstract summary: Theory of Code Space (ToCS) is a benchmark that evaluates ability to construct, maintain, and update coherent architectural beliefs.<n>ToCS requires agents to build structured belief states over module dependencies.<n>Preliminary Architectural Constraint Discovery is a code-specific evaluation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: AI code agents excel at isolated tasks yet struggle with complex, multi-file software engineering requiring understanding of how dozens of modules relate. We hypothesize these failures stem from inability to construct, maintain, and update coherent architectural beliefs during codebase exploration. We introduce Theory of Code Space (ToCS), a benchmark that evaluates this capability by placing agents in procedurally generated codebases under partial observability, requiring them to build structured belief states over module dependencies, cross-cutting invariants, and design intent. The framework features: (1) a procedural codebase generator producing medium-complexity Python projects with four typed edge categories reflecting different discovery methods -- from syntactic imports to config-driven dynamic wiring -- with planted architectural constraints and verified ground truth; (2) a partial observability harness where agents explore under a budget; and (3) periodic belief probing via structured JSON, producing a time-series of architectural understanding. We decompose the Active-Passive Gap from spatial reasoning benchmarks into selection and decision components, and introduce Architectural Constraint Discovery as a code-specific evaluation dimension. Preliminary experiments with four rule-based baselines and five frontier LLM agents from three providers validate discriminative power: methods span a wide performance range (F1 from 0.129 to 0.646), LLM agents discover semantic edge types invisible to all baselines, yet weaker models score below simple heuristics -- revealing that belief externalization, faithfully serializing internal understanding into structured JSON, is itself a non-trivial capability and a first-order confounder in belief-probing benchmarks. Open-source toolkit: https://github.com/che-shr-cat/tocs
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - Multi-CoLoR: Context-Aware Localization and Reasoning across Multi-Language Codebases [1.4216413758677147]
We present Multi-CoLoR, a framework for Context-aware localization and reasoning across Multi-Languages.<n>It integrates organizational knowledge retrieval with graph-based reasoning to traverse complex software ecosystems.
arXiv Detail & Related papers (2026-02-23T00:54:59Z) - Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond [13.550121154853715]
We present Hydra, a repository-level code generation framework that treats code as structured code rather than natural language.<n>We show that Hydra achieves state-of-the-art performance across open- and closed-source CodeLLMs.
arXiv Detail & Related papers (2026-02-12T07:44:00Z) - DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task [9.51787137194505]
DDL2PropBank is a novel benchmark task that maps relational database schemas to PropBank rolesets.<n>We implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability.<n>Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead.
arXiv Detail & Related papers (2026-02-03T01:10:59Z) - ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development [49.63491095660809]
ProjDevBench is an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.<n>We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios.<n>Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality but struggle with complex system design, time optimization, and resource management.
arXiv Detail & Related papers (2026-02-02T05:17:23Z) - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z) - Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [78.29315418819074]
We introduce VerlTool, a unified and modular framework that addresses limitations through systematic design principles.<n>Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms.<n>The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions.
arXiv Detail & Related papers (2025-09-01T01:45:18Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning [15.20947984949809]
We propose ColaUntangle, a new collaborative consultation framework for commit untangling.<n>ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture.<n>We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits)
arXiv Detail & Related papers (2025-07-22T09:42:13Z) - Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures [21.390740746718947]
We introduce DSR-Bench, the first benchmark to systematically evaluate large language models' structural reasoning.<n>The benchmark spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination.
arXiv Detail & Related papers (2025-05-29T23:24:53Z) - EpiCoder: Encompassing Diversity and Complexity in Code Generation [66.43738008739555]
Existing methods for code generation use code snippets as seed data.<n>We introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features.<n>Our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios.
arXiv Detail & Related papers (2025-01-08T18:58:15Z) - CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases [13.733229886643041]
Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and MBPP, but struggle with handling entire code repositories.
Similarity-based retrieval often has low recall in complex tasks, while manual tools and APIs are typically task-specific and require expert knowledge.
We introduce CodexGraph, a system that integrates LLM agents with graph database interfaces extracted from code repositories.
arXiv Detail & Related papers (2024-08-07T17:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.