ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
- URL: http://arxiv.org/abs/2503.08533v1
- Date: Tue, 11 Mar 2025 15:24:02 GMT
- Title: ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
- Authors: Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe,
- Abstract summary: We introduce an open-source, user-friendly toolkit to build unified web interfaces for various cascaded and E2E spoken dialogue systems.<n>Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy.<n>Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies.
- Score: 57.806797579986075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.
Related papers
- OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios [45.78414948567598]
We propose leveraging synthetic data to enhance the dialogue models across diverse scenarios.
We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios.
We also explore critical aspects of training dialogue systems using synthetic data.
arXiv Detail & Related papers (2025-01-02T17:58:23Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - L2 proficiency assessment using self-supervised speech representations [35.70742768910494]
This work extends the initial analysis conducted on a self-supervised speech representation based scheme, requiring no speech recognition, to a large scale proficiency test.
The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system.
Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription.
arXiv Detail & Related papers (2022-11-16T11:47:20Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - An Adversarial Learning based Multi-Step Spoken Language Understanding
System through Human-Computer Interaction [70.25183730482915]
We introduce a novel multi-step spoken language understanding system based on adversarial learning.
We demonstrate that the new system can improve parsing performance by at least $2.5%$ in terms of F1.
arXiv Detail & Related papers (2021-06-06T03:46:53Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z) - EmpTransfo: A Multi-head Transformer Architecture for Creating
Empathetic Dialog Systems [4.41738804598711]
This paper presents EmpTransfo, a multi-head Transformer architecture for creating an empathetic dialog system.
We show that utilizing the history of emotions and other metadata can improve the quality of generated conversations.
arXiv Detail & Related papers (2020-03-05T23:09:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.