Related papers: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

URL: http://arxiv.org/abs/2602.11988v1
Date: Thu, 12 Feb 2026 14:15:22 GMT
Title: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
Authors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev,
Abstract summary: We study whether context files are effective for real-world tasks.<n>We find that context files tend to reduce task success rates compared to providing no repository context.<n>We conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.
Score: 3.2610504259514754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.

Related papers

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents [53.03492387564392]
We introduce FS-Researcher, a file-system-based framework that scales deep research beyond the context window via a persistent workspace.<n>A Context Builder agent browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length.<n>A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts.
arXiv Detail & Related papers (2026-02-02T03:00:19Z)
AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z)
Agent READMEs: An Empirical Study of Context Files for Agentic Coding [8.019313057979522]
We study 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content.<n>We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions.<n>These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.
arXiv Detail & Related papers (2025-11-17T02:18:55Z)
Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code [0.0]
Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps.<n>We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT-5) for clarifying user requirements, an Elicit-powered semantic literature retrieval for injecting domain knowledge, and a NotebookLM-based document synthesis for contextual understanding, and a Claude Code multi-agent system for code generation and validation.
arXiv Detail & Related papers (2025-08-09T14:45:53Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [24.00351065427465]
We propose a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content.
arXiv Detail & Related papers (2024-06-26T12:26:16Z)
On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z)
On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing [82.96523584351314]
We decouple the task of context retrieval from the other components of the repository-level code editing pipelines. We conclude that while the reasoning helps to improve the precision of the gathered context, it still lacks the ability to identify its sufficiency.
arXiv Detail & Related papers (2024-06-06T19:44:17Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context [82.88371379927112]
We propose a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and a 28.69% relative increase in identifier matching for code completion when the cross-file context is provided.
arXiv Detail & Related papers (2022-12-20T05:48:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.