Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
- URL: http://arxiv.org/abs/2508.18210v1
- Date: Mon, 25 Aug 2025 17:10:36 GMT
- Title: Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
- Authors: Rishikesh Devanathan, Varun Nathan, Ayush Kumar,
- Abstract summary: Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation.<n>We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches.<n>Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism.
- Score: 3.2960068101198963
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.
Related papers
- From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants [0.0]
Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off.<n>This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts.
arXiv Detail & Related papers (2026-01-26T07:44:47Z) - AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues [11.227593857814364]
Growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and the high cost of annotation.<n>This work explores the use of synthetic Prolonged Exposure (PE) therapeutic conversations for Post-Traumatic Stress Disorder (PTSD)<n>We compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity.
arXiv Detail & Related papers (2025-04-30T16:56:56Z) - Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations [74.83732294523402]
We introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.<n>We also explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes.<n>Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64%$ in multi-round reasoning scenarios and $6.18%$ in accuracy in a noisy environment.
arXiv Detail & Related papers (2025-01-29T18:58:48Z) - Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues [66.69453609603875]
Sociocultural norms serve as guiding principles for personal conduct in social interactions.
We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs)
We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase.
arXiv Detail & Related papers (2024-10-04T00:08:46Z) - Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM [27.33193944412666]
Medical dialogue systems (MDS) enhance patient-physician communication, improve healthcare accessibility, and reduce costs.
However, acquiring suitable data to train these systems poses significant challenges.
Our approach, SynDial, uses a single LLM iteratively with zero-shot prompting and a feedback loop to generate high-quality synthetic dialogues.
arXiv Detail & Related papers (2024-08-12T16:49:22Z) - Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric [46.26506372710482]
We propose a new loss function called Semantic Infused Contextualized diaLogue (SemTextualLogue) loss function.
We also formulate an evaluation metric called Dialuation, incorporating both context and semantic relevance.
arXiv Detail & Related papers (2023-09-11T20:16:38Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - DEAM: Dialogue Coherence Evaluation using AMR-based Semantic
Manipulations [46.942369532632604]
We propose a Dialogue Evaluation metric that relies on AMR-based semantic manipulations for incoherent data generation.
Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods.
arXiv Detail & Related papers (2022-03-18T03:11:35Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.