The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines
- URL: http://arxiv.org/abs/2208.08042v1
- Date: Wed, 17 Aug 2022 03:26:23 GMT
- Title: The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines
- Authors: Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxuan Li, Zehui Yang,
Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Lei Xie, Yanmin Qian, Kong Aik
Lee, Yonghong Yan
- Abstract summary: This paper describes the Conversational Short-phrases Speaker Diarization (CSSD) task.
It consists of training and testing datasets, evaluation metric and baselines.
In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.
- Score: 63.86406909879314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The conversation scenario is one of the most important and most challenging
scenarios for speech processing technologies because people in conversation
respond to each other in a casual style. Detecting the speech activities of
each person in a conversation is vital to downstream tasks, like natural
language processing, machine translation, etc. People refer to the detection
technology of "who speak when" as speaker diarization (SD). Traditionally,
diarization error rate (DER) has been used as the standard evaluation metric of
SD systems for a long time. However, DER fails to give enough importance to
short conversational phrases, which are short but important on the semantic
level. Also, a carefully and accurately manually-annotated testing dataset
suitable for evaluating the conversational SD technologies is still unavailable
in the speech community. In this paper, we design and describe the
Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of
training and testing datasets, evaluation metric and baselines. In the dataset
aspect, despite the previously open-sourced 180-hour conversational
MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech
test dataset with carefully and artificially verified speakers timestamps
annotations for the CSSD task. In the metric aspect, we design the new
conversational DER (CDER) evaluation metric, which calculates the SD accuracy
at the utterance level. In the baseline aspect, we adopt a commonly used
method: Variational Bayes HMM x-vector system, as the baseline of the CSSD
task. Our evaluation metric is publicly available at
https://github.com/SpeechClub/CDER_Metric.
Related papers
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words [45.2706444740307]
We present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation.
We implement three different models and construct a training set following a similar process as SD-Eval.
The training set contains 1,052.72 hours of speech data and 724.4k utterances.
arXiv Detail & Related papers (2024-06-19T08:46:29Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - Towards Data Distillation for End-to-end Spoken Conversational Question
Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA)
SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora.
Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.