Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition
- URL: http://arxiv.org/abs/2309.08454v1
- Date: Fri, 15 Sep 2023 14:57:28 GMT
- Title: Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition
- Authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker,
Ralf Schl\"uter, Reinhold Haeb-Umbach
- Abstract summary: We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
- Score: 15.610658840718607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-life applications of automatic speech recognition (ASR) require
processing of overlapped speech. A commonmethod involves first separating the
speech into overlap-free streams and then performing ASR on the resulting
signals. Recently, the inclusion of a mixture encoder in the ASR model has been
proposed. This mixture encoder leverages the original overlapped speech to
mitigate the effect of artifacts introduced by the speech separation.
Previously, however, the method only addressed two-speaker scenarios. In this
work, we extend this approach to more natural meeting contexts featuring an
arbitrary number of speakers and dynamic overlaps. We evaluate the performance
using different speech separators, including the powerful TF-GridNet model. Our
experiments show state-of-the-art performance on the LibriCSS dataset and
highlight the advantages of the mixture encoder. Furthermore, they demonstrate
the strong separation of TF-GridNet which largely closes the gap between
previous methods and oracle separation.
Related papers
- TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.