Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
- URL: http://arxiv.org/abs/2601.00596v1
- Date: Fri, 02 Jan 2026 07:21:23 GMT
- Title: Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
- Authors: Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal,
- Abstract summary: We introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support.<n>We evaluate multiple state-of-the-art agent designs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA)<n>We show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o-mini.
- Score: 1.8357468337756873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
Related papers
- Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization [38.863014369090074]
We introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization.<n>Our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks.
arXiv Detail & Related papers (2026-02-14T07:44:47Z) - Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks [0.0]
We introduce RP-ReAct, a novel multi-agent approach that decouples strategic planning from low-level execution to achieve superior reliability and efficiency.<n>RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions.<n>We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models.
arXiv Detail & Related papers (2025-12-03T08:28:40Z) - Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement [61.35824395228412]
Large language model (LLM) based agents are increasingly used to tackle software engineering tasks.<n>We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions.
arXiv Detail & Related papers (2025-11-08T08:49:38Z) - Multimodal Policy Internalization for Conversational Agents [48.11601444262434]
Multimodal Policy Internalization (MPI) is a new task that internalizes reasoning-intensive multimodal policies into model parameters.<n>We build two datasets spanning synthetic and real-world decision-making and tool-using tasks.<n>TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness.
arXiv Detail & Related papers (2025-10-10T15:28:30Z) - Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z) - MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z) - MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification [5.666070277424383]
MAG-V is a framework to generate a dataset of questions that mimic customer queries.<n>Our synthetic data can improve agent performance on actual customer queries.
arXiv Detail & Related papers (2024-11-28T19:36:11Z) - CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.