Related papers: On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors

On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors

URL: http://arxiv.org/abs/2005.07383v2
Date: Tue, 1 Sep 2020 14:21:11 GMT
Title: On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors
Authors: Achintya Kumar Sarkar and Zheng-Hua Tan
Abstract summary: We study x-vectors for text-dependent speaker verification (TD-SV) We investigate the impact of the different bottleneck (BN) features on the performance of x-vectors. Experiments are conducted on the RedDots 2016 challenge database.
Score: 20.829997825439886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Applying x-vectors for speaker verification has recently attracted great interest, with the focus being on text-independent speaker verification. In this paper, we study x-vectors for text-dependent speaker verification (TD-SV), which remains unexplored. We further investigate the impact of the different bottleneck (BN) features on the performance of x-vectors, including the recently-introduced time-contrastive-learning (TCL) BN features and phone-discriminant BN features. TCL is a weakly supervised learning approach that constructs training data by uniformly partitioning each utterance into a predefined number of segments and then assigning each segment a class label depending on their position in the utterance. We also compare TD-SV performance for different modeling techniques, including the Gaussian mixture models-universal background model (GMM-UBM), i-vector, and x-vector. Experiments are conducted on the RedDots 2016 challenge database. It is found that the type of features has a marginal impact on the performance of x-vectors with the TCL BN feature achieving the lowest equal error rate, while the impact of features is significant for i-vector and GMM-UBM. The fusion of x-vector and i-vector systems gives a large gain in performance. The GMM-UBM technique shows its advantage for TD-SV using short utterances.

Related papers

Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper. The process involves generating intermediate prompts for each instance using a lightweight architecture. Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z)
Vector-ICL: In-context Learning with Continuous Vector Representations [75.96920867382859]
Large language models (LLMs) have shown remarkable in-context learning capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL.
arXiv Detail & Related papers (2024-10-08T02:25:38Z)
New Equivalences Between Interpolation and SVMs: Kernels and Structured Features [22.231455330003328]
We present a new and flexible analysis framework for proving SVP in an arbitrary kernel reproducing Hilbert space with a flexible class of generative models for the labels. We show that SVP occurs in many interesting settings not covered by prior work, and we leverage these results to prove novel generalization results for kernel SVM classification.
arXiv Detail & Related papers (2023-05-03T17:52:40Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Large-Margin Representation Learning for Texture Classification [67.94823375350433]
This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification. The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.
arXiv Detail & Related papers (2022-06-17T04:07:45Z)
On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification [18.19207291891767]
Key considerations include training targets, activation functions, and loss functions. We study a range of loss functions when speaker identity is used as the training target. We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
arXiv Detail & Related papers (2022-01-17T14:32:51Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding [0.0]
We propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV) A set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.
arXiv Detail & Related papers (2020-11-25T06:11:06Z)
Reducing Confusion in Active Learning for Part-Of-Speech Tagging [100.08742107682264]
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. We study the problem of selecting instances which maximally reduce the confusion between particular pairs of output tags. Our proposed AL strategy outperforms other AL strategies by a significant margin.
arXiv Detail & Related papers (2020-11-02T06:24:58Z)
Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products. Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings. We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z)
Probabilistic embeddings for speaker diarization [13.276960253126656]
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix. These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments.
arXiv Detail & Related papers (2020-04-06T14:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.