Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model
- URL: http://arxiv.org/abs/2111.00009v1
- Date: Sun, 31 Oct 2021 09:28:04 GMT
- Title: Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model
- Authors: Martin Kocour, Kate\v{r}ina \v{Z}mol\'ikov\'a, Lucas Ondel, J\'an
\v{S}vec, Marc Delcroix, Tsubasa Ochiai, Luk\'a\v{s} Burget, Jan
\v{C}ernock\'y
- Abstract summary: We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
- Score: 34.061441900912136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In typical multi-talker speech recognition systems, a neural network-based
acoustic model predicts senone state posteriors for each speaker. These are
later used by a single-talker decoder which is applied on each speaker-specific
output stream separately. In this work, we argue that such a scheme is
sub-optimal and propose a principled solution that decodes all speakers
jointly. We modify the acoustic model to predict joint state posteriors for all
speakers, enabling the network to express uncertainty about the attribution of
parts of the speech signal to the speakers. We employ a joint decoder that can
make use of this uncertainty together with higher-level language information.
For this, we revisit decoding algorithms used in factorial generative models in
early multi-talker speech recognition systems. In contrast with these early
works, we replace the GMM acoustic model with DNN, which provides greater
modeling power and simplifies part of the inference. We demonstrate the
advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS
dataset.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - A review of on-device fully neural end-to-end automatic speech
recognition algorithms [20.469868150587075]
We review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications.
fully neural network end-to-end speech recognition algorithms have been proposed.
We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
arXiv Detail & Related papers (2020-12-14T22:18:08Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.