Neural PLDA Modeling for End-to-End Speaker Verification
- URL: http://arxiv.org/abs/2008.04527v1
- Date: Tue, 11 Aug 2020 05:54:54 GMT
- Title: Neural PLDA Modeling for End-to-End Speaker Verification
- Authors: Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy
- Abstract summary: We propose a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA)
In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end fashion.
We show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
- Score: 40.842070706362534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While deep learning models have made significant advances in supervised
classification problems, the application of these models for out-of-set
verification tasks like speaker recognition has been limited to deriving
feature embeddings. The state-of-the-art x-vector PLDA based speaker
verification systems use a generative model based on probabilistic linear
discriminant analysis (PLDA) for computing the verification score. Recently, we
had proposed a neural network approach for backend modeling in speaker
verification called the neural PLDA (NPLDA) where the likelihood ratio score of
the generative PLDA model is posed as a discriminative similarity function and
the learnable parameters of the score function are optimized using a
verification cost. In this paper, we extend this work to achieve joint
optimization of the embedding neural network (x-vector network) with the NPLDA
network in an end-to-end (E2E) fashion. This proposed end-to-end model is
optimized directly from the acoustic features with a verification cost function
and during testing, the model directly outputs the likelihood ratio score. With
various experiments using the NIST speaker recognition evaluation (SRE) 2018
and 2019 datasets, we show that the proposed E2E model improves significantly
over the x-vector PLDA baseline speaker verification system.
Related papers
- Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - NPLDA: A Deep Neural PLDA Model for Speaker Verification [40.842070706362534]
We propose a neural network approach for backend modeling in speaker recognition.
The proposed model, termed as neural PLDA (NPLDA), is optimized using the generative PLDA model parameters.
In experiments, the NPLDA model optimized using the proposed loss function improves significantly over the state-of-art PLDA based speaker verification system.
arXiv Detail & Related papers (2020-02-10T05:47:35Z) - Pairwise Discriminative Neural PLDA for Speaker Verification [41.76303371621405]
We propose a Pairwise neural discriminative model for the task of speaker verification.
We construct a differentiable cost function which approximates speaker verification loss.
Experiments are performed on the NIST SRE 2018 development and evaluation datasets.
arXiv Detail & Related papers (2020-01-20T09:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.