SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents
- URL: http://arxiv.org/abs/2305.13040v5
- Date: Tue, 12 Mar 2024 08:52:02 GMT
- Title: SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents
- Authors: Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei
Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li
- Abstract summary: SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
- Score: 72.42049370297849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Task-oriented dialogue (TOD) models have made significant progress in recent
years. However, previous studies primarily focus on datasets written by
annotators, which has resulted in a gap between academic research and
real-world spoken conversation scenarios. While several small-scale spoken TOD
datasets are proposed to address robustness issues such as ASR errors, they
ignore the unique challenges in spoken conversation. To tackle the limitations,
we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,
containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from
human-to-human spoken conversations. SpokenWOZ further incorporates common
spoken characteristics such as word-by-word processing and reasoning in spoken
language. Based on these characteristics, we present cross-turn slot and
reasoning slot detection as new challenges. We conduct experiments on various
baselines, including text-modal models, newly proposed dual-modal models, and
LLMs, e.g., ChatGPT. The results show that the current models still have
substantial room for improvement in spoken conversation, where the most
advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and
the SOTA end-to-end model only correctly completes the user request in 52.1% of
dialogues. The dataset, code, and leaderboard are available:
https://spokenwoz.github.io/.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Retrieval Augmented End-to-End Spoken Dialog Models [20.896330994089283]
We apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness.
We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance.
arXiv Detail & Related papers (2024-02-02T18:23:09Z) - Are LLMs Robust for Spoken Dialogues? [10.855403629160921]
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks.
Most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations.
We have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets.
arXiv Detail & Related papers (2024-01-04T14:36:38Z) - PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented
Dialogs [39.58414649004708]
PRESTO is a dataset of over 550K contextual multilingual conversations between humans and virtual assistants.
It contains challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions.
Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model.
arXiv Detail & Related papers (2023-03-15T21:51:13Z) - The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines [63.86406909879314]
This paper describes the Conversational Short-phrases Speaker Diarization (CSSD) task.
It consists of training and testing datasets, evaluation metric and baselines.
In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.
arXiv Detail & Related papers (2022-08-17T03:26:23Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - TOD-DA: Towards Boosting the Robustness of Task-oriented Dialogue
Modeling on Spoken Conversations [24.245354500835465]
We propose a novel model-agnostic data augmentation paradigm to boost the robustness of task-oriented dialogue modeling on spoken conversations.
Our approach ranked first in both tasks of DSTC10 Track2, a benchmark for task-oriented dialogue modeling on spoken conversations.
arXiv Detail & Related papers (2021-12-23T10:04:25Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - Interview: A Large-Scale Open-Source Corpus of Media Dialog [11.28504775964698]
We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts.
Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance.
'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems.
arXiv Detail & Related papers (2020-04-07T02:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.