Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
- URL: http://arxiv.org/abs/2602.14955v1
- Date: Mon, 16 Feb 2026 17:36:05 GMT
- Title: Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
- Authors: Varun Nathan, Shreyas Guha, Ayush Kumar,
- Abstract summary: We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers.<n>Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator and a one-shot evaluator, and (ii) a data methodology that iteratively refines plans via an evaluator->optimizer loop.
- Score: 2.8180871881371456
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
Related papers
- PerfGuard: A Performance-Aware Agent for Visual Content Generation [53.591105729011595]
PerfGuard is a performance-aware agent framework for visual content generation.<n>It integrates tool performance boundaries into task planning and scheduling.<n>It has advantages in tool selection accuracy, execution reliability, and alignment with user intent.
arXiv Detail & Related papers (2026-01-30T05:12:19Z) - TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks [23.96822236741708]
Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding.<n>This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling.
arXiv Detail & Related papers (2025-11-03T12:45:39Z) - Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning [0.0]
We propose a Knowledge Graph-based tool retrieval framework that captures the semantic relationships between tools and their functional dependencies.<n>Our retrieval algorithm leverages ensembles of 1-hop ego tool graphs to model direct and indirect connections between tools.<n>Results demonstrate that our tool graph-based method achieves 91.85% tool coverage on the micro-average Complete Recall metric.
arXiv Detail & Related papers (2025-08-07T22:41:12Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z) - DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [10.712756715779822]
Large Language Models (LLMs) have shown promise in data processing.<n>These frameworks focus on reducing cost when executing user-specified operations.<n>This is problematic for complex tasks and data.<n>We present DocETL, a system that optimize complex document processing pipelines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z) - Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models [28.67532617021655]
Large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning.
Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints.
We propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task.
arXiv Detail & Related papers (2024-10-04T07:58:05Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization.
It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving.
A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z) - ProTIP: Progressive Tool Retrieval Improves Planning [14.386337505825228]
We introduce the Progressive Tool retrieval to Improve Planning (ProTIP) framework.
ProTIP implicitly performs TD without the explicit requirement of subtask labels, while simultaneously maintaining subtask-tool atomicity.
On the ToolBench dataset, ProTIP outperforms the ChatGPT task decomposition-based approach by a remarkable margin.
arXiv Detail & Related papers (2023-12-16T05:43:11Z) - ADaPT: As-Needed Decomposition and Planning with Language Models [131.063805299796]
We introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT)
ADaPT explicitly plans and decomposes complex sub-tasks as-needed, when the Large Language Models is unable to execute them.
Our results demonstrate that ADaPT substantially outperforms established strong baselines.
arXiv Detail & Related papers (2023-11-08T17:59:15Z) - Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.