Related papers: PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

URL: http://arxiv.org/abs/2602.05354v1
Date: Thu, 05 Feb 2026 06:24:23 GMT
Title: PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
Authors: Shifat E. Arman, Syed Nazmus Sakib, Tapodhir Karmakar Taton, Nafiul Haque, Shahrear Bin Amin,
Abstract summary: PATHWAYS is a benchmark of 250 multi-step decision tasks.<n>It tests whether web-based agents can discover and correctly use hidden contextual information.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce PATHWAYS, a benchmark of 250 multi-step decision tasks that test whether web-based agents can discover and correctly use hidden contextual information. Across both closed and open models, agents typically navigate to relevant pages but retrieve decisive hidden evidence in only a small fraction of cases. When tasks require overturning misleading surface-level signals, performance drops sharply to near chance accuracy. Agents frequently hallucinate investigative reasoning by claiming to rely on evidence they never accessed. Even when correct context is discovered, agents often fail to integrate it into their final decision. Providing more explicit instructions improves context discovery but often reduces overall accuracy, revealing a tradeoff between procedural compliance and effective judgement. Together, these results show that current web agent architectures lack reliable mechanisms for adaptive investigation, evidence integration, and judgement override.

Related papers

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics [9.25340189071758]
We present iAgentBench, a dynamic ODQA benchmark for cross-source sensemaking.<n>iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions.<n>Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks.
arXiv Detail & Related papers (2026-03-04T22:40:08Z)
To Search or Not to Search: Aligning the Decision Boundary of Deep Search Agents via Causal Intervention [61.82680155643223]
We identify the root cause of misaligned decision boundaries, the threshold determining when accumulated information suffices to answer.<n>This causes over-search (redundant searching despite sufficient knowledge) and under-search (premature termination yielding incorrect answers.<n>We propose a comprehensive framework comprising two key components. First, we introduce causal intervention-based diagnosis that identifies boundary errors.<n>Second, we develop Decision Boundary Alignment for Deep Search agents (DAS)<n>Our DAS method effectively calibrates these boundaries, mitigating both over-search and under-search to achieve substantial gains in accuracy and efficiency.
arXiv Detail & Related papers (2026-02-03T09:29:06Z)
The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z)
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z)
Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis [10.749786847079163]
Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust.<n>We propose an interactive agent that produces explanations through an auditable sequence of actions.<n>This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable.
arXiv Detail & Related papers (2025-11-03T10:21:35Z)
InteractComp: Evaluating Search Agents With Ambiguous Queries [36.05005463045869]
We introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search.<n> Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context.<n>This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents.
arXiv Detail & Related papers (2025-10-28T17:35:54Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains [32.71308102835446]
Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains.<n>We show that RAG is vulnerable to universal poisoning attacks in medical Q&A.<n>We develop a new detection-based defense to ensure the safe use of RAG.
arXiv Detail & Related papers (2024-09-12T02:43:40Z)
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions. We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries. We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.