DiariST: Streaming Speech Translation with Speaker Diarization
- URL: http://arxiv.org/abs/2309.08007v2
- Date: Mon, 22 Jan 2024 23:05:55 GMT
- Title: DiariST: Streaming Speech Translation with Speaker Diarization
- Authors: Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian
Xue, Jinyu Li, Takuya Yoshioka
- Abstract summary: We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
- Score: 53.595990270899414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech translation (ST) for conversation recordings involves
several under-explored challenges such as speaker diarization (SD) without
accurate word time stamps and handling of overlapping speech in a streaming
fashion. In this work, we propose DiariST, the first streaming ST and SD
solution. It is built upon a neural transducer-based streaming ST system and
integrates token-level serialized output training and t-vector, which were
originally developed for multi-talker speech recognition. Due to the absence of
evaluation benchmarks in this area, we develop a new evaluation dataset,
DiariST-AliMeeting, by translating the reference Chinese transcriptions of the
AliMeeting corpus into English. We also propose new metrics, called
speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality
while taking SD accuracy into account. Our system achieves a strong ST and SD
capability compared to offline systems based on Whisper, while performing
streaming inference for overlapping speech. To facilitate the research in this
new direction, we release the evaluation data, the offline baseline systems,
and the evaluation code.
Related papers
- Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines [63.86406909879314]
This paper describes the Conversational Short-phrases Speaker Diarization (CSSD) task.
It consists of training and testing datasets, evaluation metric and baselines.
In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.
arXiv Detail & Related papers (2022-08-17T03:26:23Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers [35.2855796745394]
We introduce a streaming end-to-end speech translation (ST) model to convert audio signals to texts in other languages directly.
Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency.
We extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time.
arXiv Detail & Related papers (2022-04-11T18:18:53Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - MeetDot: Videoconferencing with Live Translation Captions [18.60812558978417]
We present MeetDot, a videoconferencing system with live translation captions overlaid on screen.
Our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade.
We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker.
arXiv Detail & Related papers (2021-09-20T14:34:14Z) - UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation [12.63410397982031]
We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
arXiv Detail & Related papers (2021-09-15T15:22:10Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.