Related papers: When Is Enough Not Enough? Illusory Completion in Search Agents

When Is Enough Not Enough? Illusory Completion in Search Agents

URL: http://arxiv.org/abs/2602.07549v1
Date: Sat, 07 Feb 2026 13:50:38 GMT
Title: When Is Enough Not Enough? Illusory Completion in Search Agents
Authors: Dayoon Ko, Jihyuk Kim, Sohyeon Kim, Haeju Park, Dahyun Lee, Gunhee Kim, Moontae Lee, Kyungjae Lee,
Abstract summary: We study whether search agents reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions.<n>We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers.<n>We examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker.
Score: 56.98225130959051
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.

Related papers

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z)
DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents [10.197402632091551]
DeepSearchQA is a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks.<n>This dataset is designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists.
arXiv Detail & Related papers (2026-01-28T19:20:47Z)
Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks [54.31998314008198]
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks.<n>We attribute this limitation to textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set.<n>We propose the textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths.
arXiv Detail & Related papers (2025-12-01T14:35:06Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.5807076505261]
Large Reasoning Models (LRMs) have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference.<n>A growing concern lies in their tendency to produce excessively long reasoning traces.<n>This inefficiency introduces significant challenges for training, inference, and real-world deployment.
arXiv Detail & Related papers (2025-03-27T15:36:30Z)
Why Do Multi-Agent LLM Systems Fail? [87.90075668488434]
We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks.<n>We build the first Multi-Agent System Failure taxonomy (MAST)<n>We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent)
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
Regression with Multi-Expert Deferral [30.389055604165222]
Learning to defer with multiple experts is a framework where the learner can choose to defer the prediction to several experts. We present a novel framework of regression with deferral, which involves deferring the prediction to multiple experts. We introduce new surrogate loss functions for both scenarios and prove that they are supported by $H$-consistency bounds.
arXiv Detail & Related papers (2024-03-28T15:26:38Z)
Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives [8.524720028421447]
This paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing Large Language Models (LLMs) such as GPT-4. generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives. We propose an adversarial framework dubbed GPTLens that breaks the conventional one-stage detection into two synergistic stages $-$ generation and discrimination.
arXiv Detail & Related papers (2023-10-02T12:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.