Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models
- URL: http://arxiv.org/abs/2510.03696v1
- Date: Sat, 04 Oct 2025 06:22:47 GMT
- Title: Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models
- Authors: Deepak Babu Piskala, Sharlene Chen, Udita Patel, Parul Kalra, Rafael Castrillo,
- Abstract summary: We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems.<n>Our method segments conversations by user goals and evaluates success using all relevant turns.<n>In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the \textbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a \textbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling \textit{explainable}, \textit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63\% to 79\% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.
Related papers
- LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer [19.09571232466437]
We propose Agent-as-Interviewer, a dynamic evaluation paradigm for large language models (LLMs)<n>Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation.<n>We develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance.
arXiv Detail & Related papers (2025-09-02T08:52:16Z) - Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation [4.08768677009363]
We propose a generalizable, modular framework for evaluating agent task completion independent of the task domain.<n>We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench.<n>Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively.
arXiv Detail & Related papers (2025-08-07T15:39:48Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment [4.406086834602686]
We show how to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution.<n>Our approach utilizes a centralized reward-critic which numerically decomposes the environment reward based on the individual contribution of each agent.<n>Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints.
arXiv Detail & Related papers (2025-02-24T05:56:47Z) - QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values.<n>With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value.<n>We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z) - SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.<n>SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.<n>We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z) - TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains [19.492393243160244]
Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains.<n>Existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets.<n>We propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.