Related papers: From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

URL: http://arxiv.org/abs/2602.15859v1
Date: Mon, 26 Jan 2026 07:44:47 GMT
Title: From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
Authors: Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn,
Abstract summary: Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off.<n>This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

Related papers

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning [12.024430772980502]
We introduce an agent-centric benchmarking paradigm for evaluating large language models.<n>A teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks.<n>If a student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants.
arXiv Detail & Related papers (2026-02-27T06:54:32Z)
SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs [28.59157823781425]
SEAL is a novel two-stage semantic parsing framework grounded in self-evolving agentic learning.<n> SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks.<n>The results validate notable gains in both structural accuracy and computational efficiency.
arXiv Detail & Related papers (2025-12-04T14:52:30Z)
A Knowledge Graph and a Tripartite Evaluation Framework Make Retrieval-Augmented Generation Scalable and Transparent [0.0]
This study presents a Retrieval Augmented Generation (RAG) that harnesses a knowledge graph and vector search retrieval to deliver context-rich responses.<n>A central innovation of this work is the introduction of RAG Evaluation (RAG-Eval), a novel chain-of-thought tripartite evaluation framework.<n>RAG-Eval reliably detects factual gaps and query mismatches, thereby fostering trust in high demand, data centric environments.
arXiv Detail & Related papers (2025-09-23T16:29:22Z)
Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives [2.7295959384567356]
Co-Investigator AI is an agentic framework optimized to produce Suspicious Activity Reports (SARs) significantly faster and with greater accuracy than traditional methods.<n>We demonstrate its ability to streamline SAR drafting, align narratives with regulatory expectations, and enable compliance teams to focus on higher-order analytical work.
arXiv Detail & Related papers (2025-09-10T08:16:04Z)
Cloning a Conversational Voice AI Agent from Call\,Recording Datasets for Telesales [0.0]
We present a methodology for cloning a conversational voice AI agent from a corpus of call recordings.<n>Our system listens to customers over the telephone, responds with a synthetic voice, and follows a structured playbook learned from top performing human agents.<n>The cloned agent is evaluated against human agents on a rubric of 22 criteria covering introduction, product communication, sales drive, objection handling, and closing.
arXiv Detail & Related papers (2025-09-05T07:36:12Z)
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench [58.114899897566964]
In a multi-turn conversational environment, large language models (LLMs) often struggle with consistent reasoning and adherence to domain-specific policies.<n>We propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules.<n>IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively.
arXiv Detail & Related papers (2025-08-28T15:57:33Z)
Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance [58.21767225794469]
Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change.<n>We propose the Adaptive Reflective Interactive Agent (ARIA) to continuously learn updated domain knowledge at test time.<n>ARIA is deployed within TikTok Pay serving over 150 million monthly active users.
arXiv Detail & Related papers (2025-07-23T02:12:32Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. We propose a framework that we call controllable grounded response generation (CGRG) We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)
Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.