Related papers: ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions

ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions

URL: http://arxiv.org/abs/2603.00187v1
Date: Fri, 27 Feb 2026 01:10:27 GMT
Title: ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions
Authors: Jialin Li, Yuan Wu, Yi Chang,
Abstract summary: We introduce ClarEval, a framework designed to assess an agent's "Collaborative Quotient" by simulating the inherent ambiguity of human communication.<n>To quantify this capability, we propose a metric suite led by Average Turns to Clarify coders (ATC) and Key Question Coverage (KQC)<n>Our experiments on eleven state-of-the-art agents reveal a stark reality: while models like GPT-5-Coder excel at coding, they often lack the strategic communication skills required for efficient partnership.
Score: 19.875754116636436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To integrate seamlessly into real-world software engineering, Code Agents must evolve from passive instruction followers into proactive collaborative partners. However, current evaluation paradigms predominantly reward "guessing" user intent under ideal conditions, neglecting the agent's ability to align with users through dialogue--a critical trait for collaborative intelligence. In this work, we propose a paradigm shift in evaluation to drive this transition. We introduce ClarEval, a framework designed to assess an agent's "Collaborative Quotient" by simulating the inherent ambiguity of human communication. By systematically injecting three types of realistic ambiguity (missing goals, premises, and ambiguous terminology) into standard tasks, we force agents to step out of their "generator" role and engage in requirement elicitation. To quantify this capability, we propose a metric suite led by Average Turns to Clarify (ATC) and Key Question Coverage (KQC), which measure not just the correctness of the generated code, but the efficiency and precision of the collaboration. Our experiments on eleven state-of-the-art agents reveal a stark reality: while models like GPT-5-Coder excel at coding, they often lack the strategic communication skills required for efficient partnership. ClarEval thus serves as a crucial roadmap for bridging the gap between strong coders and capable collaborators.The code is available at https://github.com/JialinLi13/ClarEval

Related papers

Towards Adaptive, Scalable, and Robust Coordination of LLM Agents: A Dynamic Ad-Hoc Networking Perspective [31.81236449944822]
RAPS is a reputation-aware publish-subscribe paradigm for adaptive, scalable, and robust coordination of LLM agents.<n>RAPS incorporates two coherent overlays: (i) Reactive Subscription, enabling agents to dynamically refine their intents; and (ii) Bayesian Reputation, empowering each agent with a local watchdog to detect and isolate malicious peers.
arXiv Detail & Related papers (2026-02-08T15:26:02Z)
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
CooperBench: Why Coding Agents Cannot be Your Teammates Yet [44.06715229961526]
CooperBench is a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages.<n>Agents achieve on average 30% lower success rates when working together compared to performing both tasks individually.<n>Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans and communication.
arXiv Detail & Related papers (2026-01-19T18:48:37Z)
From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering [7.402388519535592]
Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership.<n>We present a foundational taxonomy of desirable agent behaviors for enterprise software engineering.<n>We also introduce the Context-Adaptive Behavior (CAB) Framework.
arXiv Detail & Related papers (2025-12-29T20:18:57Z)
Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems: A Blockchain-Driven Approach [21.498244821985562]
We propose a blockchain-based framework that enables transparent agent registration, verifiable task allocation, and dynamic reputation tracking.<n>Our implementation integrates GPT-4 agents with Solidity contracts and demonstrates, through 50-round simulations, strong task success rates, stable utility distribution, and emergent agent specialization.
arXiv Detail & Related papers (2025-09-20T16:00:24Z)
Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning [1.974921946982281]
We present MSARL, a framework that explicitly decouples reasoning from tool use.<n>In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools.<n>On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines.
arXiv Detail & Related papers (2025-08-12T12:10:53Z)
Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows [60.04362496037186]
We present the first controlled study of developer interactions with coding agents.<n>We evaluate two leading copilot and agentic coding assistants.<n>Our results show agents can assist developers in ways that surpass copilots.
arXiv Detail & Related papers (2025-07-10T20:12:54Z)
Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents [4.960232980231203]
CopilotLens is an interactive framework that reframes code completion from a simple suggestion into a transparent, explainable interaction.<n>CopilotLens operates as an explanation layer that reconstructs the AI agent's "thought process" through a dynamic, two-level interface.
arXiv Detail & Related papers (2025-06-24T23:50:03Z)
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values.<n>With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value.<n>We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
ProAgent: Building Proactive Cooperative Agents with Large Language Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents. ProAgent can analyze the present state, and infer the intentions of teammates from observations. ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z)
RACA: Relation-Aware Credit Assignment for Ad-Hoc Cooperation in Multi-Agent Deep Reinforcement Learning [55.55009081609396]
We propose a novel method, called Relation-Aware Credit Assignment (RACA), which achieves zero-shot generalization in ad-hoc cooperation scenarios. RACA takes advantage of a graph-based encoder relation to encode the topological structure between agents. Our method outperforms baseline methods on the StarCraftII micromanagement benchmark and ad-hoc cooperation scenarios.
arXiv Detail & Related papers (2022-06-02T03:39:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.