SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
- URL: http://arxiv.org/abs/2512.09142v2
- Date: Fri, 12 Dec 2025 00:59:21 GMT
- Title: SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
- Authors: Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Esaú Villatoro-Tello, Thomas Schaaf, Ricard Marxer, Petr Motlicek,
- Abstract summary: SDialog is an MIT-licensed open-source Python toolkit for building and analyzing conversational agents.<n>It unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework.<n>By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
- Score: 19.007557608856565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
Related papers
- Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems [17.54500572999039]
Full voice interaction is process of natural human computer interaction.<n>This framework framework synthesises complex dialogue into minimal conversational units.<n>System operates in a train-free, plug-play manner.
arXiv Detail & Related papers (2026-01-28T04:00:37Z) - A Multimodal Conversational Agent for Tabular Data Analysis [0.2211620227346065]
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance.<n>We present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration.<n>The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations.
arXiv Detail & Related papers (2025-11-23T11:21:04Z) - ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation [0.0]
ChatChecker is a framework for automated evaluation and testing of complex dialogue systems.<n>It uses large language models (LLMs) to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality.
arXiv Detail & Related papers (2025-07-22T17:40:34Z) - SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis [0.7919810878571298]
SDialog is a modular, realistic Python toolkit designed to address the challenges of synthetic dialogue generation and analysis.<n>By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management.
arXiv Detail & Related papers (2025-06-12T12:07:51Z) - clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations [18.256529559741075]
clem todd is a framework for systematically evaluating dialogue systems under consistent conditions.<n>It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints.<n>Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance.
arXiv Detail & Related papers (2025-05-08T17:36:36Z) - DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling [73.08187964426823]
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction.<n>This paper introduces a new research task--$textbfD$ialogue $textbfE$lement $textbfMO$deling.<n>We propose a novel benchmark, $textbfDEMO$, designed for a comprehensive dialogue modeling and assessment.
arXiv Detail & Related papers (2024-12-06T10:01:38Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue
Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response.
We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English.
Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog
Understanding and Generation [123.37377363355363]
SPACE-3 is a novel unified semi-supervised pre-trained conversation model learning from large-scale dialog corpora.
It can be effectively fine-tuned on a wide range of downstream dialog tasks.
Results show that SPACE-3 achieves state-of-the-art performance on eight downstream dialog benchmarks.
arXiv Detail & Related papers (2022-09-14T14:17:57Z) - SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for
Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora.
Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z) - ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and
Diagnosing Dialogue Systems [107.35174238206525]
ConvLab-2 is an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models.
The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues.
The interactive tool allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component.
arXiv Detail & Related papers (2020-02-12T04:31:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.