Related papers: Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

URL: http://arxiv.org/abs/2602.05447v2
Date: Thu, 12 Feb 2026 12:19:22 GMT
Title: Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale
Authors: Damon McMillan,
Abstract summary: Large Language Model agents increasingly operate systems through programmatic interfaces.<n>Yet practitioners lack empirical guidance on how to structure the context these agents consume.<n>We study 9,649 experiments across 11 models, 4 formats, and schemas ranging from 10 to 10,000 tables.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact or novel formats can incur a token overhead driven by grep output density and pattern unfamiliarity, with the magnitude depending on model capability. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.

Related papers

Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models [57.94189874119267]
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems.<n>Current graph learning-based design methodologies often adhere to a "one-for-one" paradigm.<n>We propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language.
arXiv Detail & Related papers (2026-01-19T12:23:44Z)
HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z)
PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework [5.863391019411233]
Large language models show promise for individual tasks, but end-to-end analytical expose fundamental limitations.<n>We present PublicAgent, a multi-agent framework that addresses these limitations through decomposition into specialized agents for intent clarification, dataset discovery, analysis, and reporting.
arXiv Detail & Related papers (2025-11-04T21:48:11Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks [3.3705400036304205]
"Semantic drift" compromises data and governance, and impairs the utility of services like text-to-RAG.<n>This paper proposes a novel framework for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts.<n>Result: A 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting.
arXiv Detail & Related papers (2025-08-10T05:04:32Z)
AI-assisted JSON Schema Creation and Mapping [0.0]
We present a hybrid approach that combines large language models (LLMs) with deterministic techniques to enable creation, modification, and schema mapping based on natural language inputs by the user.<n>This work significantly lowers the barrier to structured data modeling and data integration for non-experts.
arXiv Detail & Related papers (2025-08-07T09:27:10Z)
SLOT: Structuring the Output of Large Language Models [5.683327173793259]
We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats.<n>Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy.<n> Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models.
arXiv Detail & Related papers (2025-05-06T23:29:43Z)
Why Personalizing Deep Learning-Based Code Completion Tools Matters [55.39571645315926]
We consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters)<n>For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned on organization- and developer-specific datasets.<n>Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning.
arXiv Detail & Related papers (2025-03-18T12:26:06Z)
Neural Production Systems [90.75211413357577]
Visual environments are structured, consisting of distinct objects or entities. To partition images into entities, deep-learning researchers have proposed structural inductive biases. We take inspiration from cognitive science and resurrect a classic approach, which consists of a set of rule templates. This architecture achieves a flexible, dynamic flow of control and serves to factorize entity-specific and rule-based information.
arXiv Detail & Related papers (2021-03-02T18:53:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.