Related papers: Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

URL: http://arxiv.org/abs/2011.12536v2
Date: Thu, 25 Mar 2021 18:57:26 GMT
Title: Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding
Authors: Achintya kr. Sarkar, Zheng-Hua Tan (Senior Member, IEEE)
Abstract summary: We propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV) A set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision. Next, we explore the bottleneck (BN) feature extracted by training deep neural networks with a self-supervised objective, autoregressive predictive coding (APC), for TD-SV and compare it with the well-studied speaker-discriminant BN feature. The proposed VTL method is then applied to APC and speaker-discriminant BN features. In the end, we combine the VTL perturbation systems trained on MFCC and the two BN features in the score domain. Experiments are performed on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Results show the proposed methods significantly outperform the baselines.

Related papers

Vocal Tract Length Warped Features for Spoken Keyword Spotting [11.362295176098067]
We propose several methods that incorporate vocal tract length (VTL) features for spoken keyword spotting (KWS) The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. The second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, warped VTL features to form high-dimensional features for KWS.
arXiv Detail & Related papers (2025-01-07T04:38:28Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation. We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts. Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z)
On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification [18.19207291891767]
Key considerations include training targets, activation functions, and loss functions. We study a range of loss functions when speaker identity is used as the training target. We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
arXiv Detail & Related papers (2022-01-17T14:32:51Z)
Learning a Word-Level Language Model with Sentence-Level Noise Contrastive Estimation for Contextual Sentence Probability Estimation [3.1040192682787415]
Inferring the probability distribution of sentences or word sequences is a key process in natural language processing. While word-level language models (LMs) have been widely adopted for computing the joint probabilities of word sequences, they have difficulty capturing a context long enough for sentence probability estimation (SPE) Recent studies introduced training methods using sentence-level noise-contrastive estimation (NCE) with recurrent neural networks (RNNs) We apply our method to a simple word-level RNN LM to focus on the effect of the sentence-level NCE training rather than on the network architecture.
arXiv Detail & Related papers (2021-03-14T09:17:37Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
Data Generation Using Pass-phrase-dependent Deep Auto-encoders for Text-Dependent Speaker Verification [25.318439244029094]
We propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV) Each PP-DNN auto-encoder is trained using the utterances of a particular pass-phrase available in the target enrollment set. Experiments are conducted on the RedDots challenge 2016 database for TD-SV using short utterances.
arXiv Detail & Related papers (2021-02-03T14:06:29Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors [20.829997825439886]
We study x-vectors for text-dependent speaker verification (TD-SV) We investigate the impact of the different bottleneck (BN) features on the performance of x-vectors. Experiments are conducted on the RedDots 2016 challenge database.
arXiv Detail & Related papers (2020-05-15T07:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.