Streaming Multi-Talker ASR with Token-Level Serialized Output Training
- URL: http://arxiv.org/abs/2202.00842v1
- Date: Wed, 2 Feb 2022 01:27:21 GMT
- Title: Streaming Multi-Talker ASR with Token-Level Serialized Output Training
- Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang,
Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
- Abstract summary: t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
- Score: 53.11450530896623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel
framework for streaming multi-talker automatic speech recognition (ASR). Unlike
existing streaming multi-talker ASR models using multiple output layers, the
t-SOT model has only a single output layer that generates recognition tokens
(e.g., words, subwords) of multiple speakers in chronological order based on
their emission times. A special token that indicates the change of "virtual"
output channels is introduced to keep track of the overlapping utterances.
Compared to the prior streaming multi-talker ASR models, the t-SOT model has
the advantages of less inference cost and a simpler model architecture.
Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the
t-SOT-based transformer transducer model achieves the state-of-the-art word
error rates by a significant margin to the prior results. For non-overlapping
speech, the t-SOT model is on par with a single-talker ASR model in terms of
both accuracy and computational cost, opening the door for deploying one model
for both single- and multi-talker scenarios.
Related papers
- Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - VarArray Meets t-SOT: Advancing the State of the Art of Streaming
Distant Conversational Speech Recognition [36.580955189182404]
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry.
Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT)
Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant
arXiv Detail & Related papers (2022-09-12T01:22:04Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - A Comparison of Label-Synchronous and Frame-Synchronous End-to-End
Models for Speech Recognition [35.14176176739817]
We make a comparison on a representative label-synchronous model (transformer) and a soft frame-synchronous model (continuous integrate-and-fire) based model.
The results on three public dataset and a large-scale dataset with 12000 hours of training data show that the two types of models have respective advantages that are consistent with their synchronous mode.
arXiv Detail & Related papers (2020-05-20T15:10:35Z) - Serialized Output Training for End-to-End Overlapped Speech Recognition [35.894025054676696]
serialized output training (SOT) is a novel framework for multi-speaker overlapped speech recognition.
SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another.
We show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models.
arXiv Detail & Related papers (2020-03-28T02:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.