On Bottleneck Features for Text-Dependent Speaker Verification Using
X-vectors
- URL: http://arxiv.org/abs/2005.07383v2
- Date: Tue, 1 Sep 2020 14:21:11 GMT
- Title: On Bottleneck Features for Text-Dependent Speaker Verification Using
X-vectors
- Authors: Achintya Kumar Sarkar and Zheng-Hua Tan
- Abstract summary: We study x-vectors for text-dependent speaker verification (TD-SV)
We investigate the impact of the different bottleneck (BN) features on the performance of x-vectors.
Experiments are conducted on the RedDots 2016 challenge database.
- Score: 20.829997825439886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Applying x-vectors for speaker verification has recently attracted great
interest, with the focus being on text-independent speaker verification. In
this paper, we study x-vectors for text-dependent speaker verification (TD-SV),
which remains unexplored. We further investigate the impact of the different
bottleneck (BN) features on the performance of x-vectors, including the
recently-introduced time-contrastive-learning (TCL) BN features and
phone-discriminant BN features. TCL is a weakly supervised learning approach
that constructs training data by uniformly partitioning each utterance into a
predefined number of segments and then assigning each segment a class label
depending on their position in the utterance. We also compare TD-SV performance
for different modeling techniques, including the Gaussian mixture
models-universal background model (GMM-UBM), i-vector, and x-vector.
Experiments are conducted on the RedDots 2016 challenge database. It is found
that the type of features has a marginal impact on the performance of x-vectors
with the TCL BN feature achieving the lowest equal error rate, while the impact
of features is significant for i-vector and GMM-UBM. The fusion of x-vector and
i-vector systems gives a large gain in performance. The GMM-UBM technique shows
its advantage for TD-SV using short utterances.
Related papers
- New Equivalences Between Interpolation and SVMs: Kernels and Structured
Features [22.231455330003328]
We present a new and flexible analysis framework for proving SVP in an arbitrary kernel reproducing Hilbert space with a flexible class of generative models for the labels.
We show that SVP occurs in many interesting settings not covered by prior work, and we leverage these results to prove novel generalization results for kernel SVM classification.
arXiv Detail & Related papers (2023-05-03T17:52:40Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Large-Margin Representation Learning for Texture Classification [67.94823375350433]
This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification.
The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.
arXiv Detail & Related papers (2022-06-17T04:07:45Z) - On Training Targets and Activation Functions for Deep Representation
Learning in Text-Dependent Speaker Verification [18.19207291891767]
Key considerations include training targets, activation functions, and loss functions.
We study a range of loss functions when speaker identity is used as the training target.
We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
arXiv Detail & Related papers (2022-01-17T14:32:51Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Vocal Tract Length Perturbation for Text-Dependent Speaker Verification
with Autoregressive Prediction Coding [0.0]
We propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV)
A set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.
arXiv Detail & Related papers (2020-11-25T06:11:06Z) - Reducing Confusion in Active Learning for Part-Of-Speech Tagging [100.08742107682264]
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost.
We study the problem of selecting instances which maximally reduce the confusion between particular pairs of output tags.
Our proposed AL strategy outperforms other AL strategies by a significant margin.
arXiv Detail & Related papers (2020-11-02T06:24:58Z) - Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products.
Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings.
We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z) - Multidirectional Associative Optimization of Function-Specific Word
Representations [86.87082468226387]
We present a neural framework for learning associations between interrelated groups of words.
Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together.
The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure.
arXiv Detail & Related papers (2020-05-11T17:07:20Z) - Probabilistic embeddings for speaker diarization [13.276960253126656]
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization.
We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix.
These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments.
arXiv Detail & Related papers (2020-04-06T14:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.