Deep Normalization for Speaker Vectors
- URL: http://arxiv.org/abs/2004.04095v2
- Date: Mon, 2 Nov 2020 02:27:10 GMT
- Title: Deep Normalization for Speaker Vectors
- Authors: Yunqi Cai, Lantian Li, Dong Wang and Andrew Abel
- Abstract summary: Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks.
Deep speaker vectors tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers.
We propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model.
- Score: 13.310988353839237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep speaker embedding has demonstrated state-of-the-art performance in
speaker recognition tasks. However, one potential issue with this approach is
that the speaker vectors derived from deep embedding models tend to be
non-Gaussian for each individual speaker, and non-homogeneous for distributions
of different speakers. These irregular distributions can seriously impact
speaker recognition performance, especially with the popular PLDA scoring
method, which assumes homogeneous Gaussian distribution. In this paper, we
argue that deep speaker vectors require deep normalization, and propose a deep
normalization approach based on a novel discriminative normalization flow (DNF)
model. We demonstrate the effectiveness of the proposed approach with
experiments using the widely used SITW and CNCeleb corpora. In these
experiments, the DNF-based normalization delivered substantial performance
gains and also showed strong generalization capability in out-of-domain tests.
Related papers
- GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection [60.78684630040313]
Diffusion models tend to reconstruct normal counterparts of test images with certain noises added.
From the global perspective, the difficulty of reconstructing images with different anomalies is uneven.
We propose a global and local adaptive diffusion model (abbreviated to GLAD) for unsupervised anomaly detection.
arXiv Detail & Related papers (2024-06-11T17:27:23Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Dior-CVAE: Pre-trained Language Models and Diffusion Priors for
Variational Dialog Generation [70.2283756542824]
Dior-CVAE is a hierarchical conditional variational autoencoder (CVAE) with diffusion priors to address these challenges.
We employ a diffusion model to increase the complexity of the prior distribution and its compatibility with the distributions produced by a PLM.
Experiments across two commonly used open-domain dialog datasets show that our method can generate more diverse responses without large-scale dialog pre-training.
arXiv Detail & Related papers (2023-05-24T11:06:52Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Deep Speaker Vector Normalization with Maximum Gaussianality Training [13.310988353839237]
A key problem with deep speaker embedding is that the resulting deep speaker vectors tend to be irregularly distributed.
In previous research, we proposed a deep normalization approach based on a new discriminative normalization flow (DNF) model.
Despite this remarkable success, we empirically found that the latent codes produced by the DNF model are generally neither homogeneous nor Gaussian.
We propose a new Maximum Gaussianality (MG) training approach that directly maximizes the Gaussianality of the latent codes.
arXiv Detail & Related papers (2020-10-30T09:42:06Z) - Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes [36.63589873242547]
Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model.
We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs) and latent variable models (DGPLVMs)
arXiv Detail & Related papers (2020-08-07T02:03:27Z) - DNN Speaker Tracking with Embeddings [0.0]
We propose a novel embedding-based speaker tracking method.
Our design is based on a convolutional neural network that mimics a typical speaker verification PLDA.
To make the baseline system similar to speaker tracking, non-target speakers were added to the recordings.
arXiv Detail & Related papers (2020-07-13T18:40:14Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.