VarArray Meets t-SOT: Advancing the State of the Art of Streaming
Distant Conversational Speech Recognition
- URL: http://arxiv.org/abs/2209.04974v1
- Date: Mon, 12 Sep 2022 01:22:04 GMT
- Title: VarArray Meets t-SOT: Advancing the State of the Art of Streaming
Distant Conversational Speech Recognition
- Authors: Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, Jinyu Li, Takuya
Yoshioka
- Abstract summary: This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry.
Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT)
Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant
- Score: 36.580955189182404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.
Related papers
- Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - VarArray: Array-Geometry-Agnostic Continuous Speech Separation [26.938313513582642]
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model.
arXiv Detail & Related papers (2021-10-12T05:31:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.