Related papers: $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

URL: http://arxiv.org/abs/2603.04370v1
Date: Wed, 04 Mar 2026 18:34:47 GMT
Title: $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Authors: Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres,
Abstract summary: $$-Knowledge is an extension of $$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs.<n>We show that $$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
Score: 58.03692489021332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents [53.03492387564392]
We introduce FS-Researcher, a file-system-based framework that scales deep research beyond the context window via a persistent workspace.<n>A Context Builder agent browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length.<n>A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts.
arXiv Detail & Related papers (2026-02-02T03:00:19Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI [2.0619484032730813]
UpBench is a benchmark grounded in real jobs drawn from the global Upwork labor marketplace.<n>Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes.<n>UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback.
arXiv Detail & Related papers (2025-11-15T17:39:37Z)
DRBench: A Realistic Benchmark for Enterprise Deep Research [81.49694432639406]
DRBench is a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.<n>We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance.
arXiv Detail & Related papers (2025-09-30T18:47:20Z)
Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework [0.0]
This article presents a modular, component-based architecture for developing and evaluating AI agents.<n>The system addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses.<n>A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework.
arXiv Detail & Related papers (2025-09-28T23:54:41Z)
Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance [58.21767225794469]
Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change.<n>We propose the Adaptive Reflective Interactive Agent (ARIA) to continuously learn updated domain knowledge at test time.<n>ARIA is deployed within TikTok Pay serving over 150 million monthly active users.
arXiv Detail & Related papers (2025-07-23T02:12:32Z)
Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance [54.25184684077833]
We propose an efficient and scalable method for extracting quantitative insights from unstructured financial documents.<n>Our proposed system consists of two specialized agents: the emphExtraction Agent and the emphText-to-Agent
arXiv Detail & Related papers (2025-05-25T15:45:46Z)
DeepTrust: A Reliable Financial Knowledge Retrieval Framework For Explaining Extreme Pricing Anomalies [0.0]
We introduce DeepTrust, a reliable financial knowledge retrieval framework on Twitter to explain extreme price moves at speed. Our proposed framework consists of three modules, specialized for anomaly detection, information retrieval and reliability assessment. The framework is evaluated on two self-annotated financial anomalies, i.e., Twitter and Facebook stock price on 29 and 30 April 2021.
arXiv Detail & Related papers (2022-03-11T06:29:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.