Related papers: ContextBench: A Benchmark for Context Retrieval in Coding Agents

ContextBench: A Benchmark for Context Retrieval in Coding Agents

URL: http://arxiv.org/abs/2602.05892v3
Date: Wed, 11 Feb 2026 04:58:49 GMT
Title: ContextBench: A Benchmark for Context Retrieval in Coding Agents
Authors: Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Federica Sarro, Zhaoyang Chu, He Ye,
Abstract summary: We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents.<n> ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages.
Score: 26.158308735620405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval ("The Bitter Lesson" of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.

Related papers

The Limits of Long-Context Reasoning in Automated Bug Fixing [4.853967615615349]
Large language models (LLMs) can directly reason over entire contexts.<n>Recent advances in LLMs have enabled strong performance on software engineering benchmarks.<n>We systematically evaluate whether current LLMs can reliably perform long-context code and patch generation.
arXiv Detail & Related papers (2026-02-17T22:51:40Z)
AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z)
Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice [18.222990693059756]
ContextCRBench is a benchmark for fine-grained LLM evaluation in code review.<n>It collects 153.7K issues and pull requests from top-tier repositories.<n>It supports three evaluation scenarios aligned with the review workflow.
arXiv Detail & Related papers (2025-11-10T12:06:35Z)
ContextNav: Towards Agentic Multimodal In-Context Learning [85.05420047017513]
ContextNav is an agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation.<n>It builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts.<n> Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets.
arXiv Detail & Related papers (2025-10-06T07:49:52Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Context-DPO: Aligning Language Models for Context-Faithfulness [80.62221491884353]
We propose the first alignment method specifically designed to enhance large language models' context-faithfulness.<n>By leveraging faithful and stubborn responses to questions with provided context from ConFiQA, our Context-DPO aligns LLMs through direct preference optimization.<n>Extensive experiments demonstrate that our Context-DPO significantly improves context-faithfulness, achieving 35% to 280% improvements on popular open-source models.
arXiv Detail & Related papers (2024-12-18T04:08:18Z)
HyQE: Ranking Contexts with Hypothetical Query Embeddings [9.23634055123276]
In retrieval-augmented systems, context ranking techniques are commonly employed to reorder the retrieved contexts based on their relevance to a user query. Large language models (LLMs) have been used for ranking contexts. We introduce a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning.
arXiv Detail & Related papers (2024-10-20T03:15:01Z)
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding [29.129035086344143]
We introduce the Long Question Coreference Adaptation (LQCA) method to enhance the performance of large language models (LLMs)<n>This framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively.<n>Our code is public at https://github.com/OceannTwT/LQCA.
arXiv Detail & Related papers (2024-10-02T15:39:55Z)
RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [8.036117602566074]
external retrieval mechanisms are often employed to enhance the quality of augmented generations in dialogues.<n>Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings.<n>We introduce RAD-Bench, a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals.
arXiv Detail & Related papers (2024-09-19T08:26:45Z)
On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing [82.96523584351314]
We decouple the task of context retrieval from the other components of the repository-level code editing pipelines. We conclude that while the reasoning helps to improve the precision of the gathered context, it still lacks the ability to identify its sufficiency.
arXiv Detail & Related papers (2024-06-06T19:44:17Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.