Vocal Tract Length Perturbation for Text-Dependent Speaker Verification
with Autoregressive Prediction Coding
- URL: http://arxiv.org/abs/2011.12536v2
- Date: Thu, 25 Mar 2021 18:57:26 GMT
- Title: Vocal Tract Length Perturbation for Text-Dependent Speaker Verification
with Autoregressive Prediction Coding
- Authors: Achintya kr. Sarkar, Zheng-Hua Tan (Senior Member, IEEE)
- Abstract summary: We propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV)
A set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this letter, we propose a vocal tract length (VTL) perturbation method for
text-dependent speaker verification (TD-SV), in which a set of TD-SV systems
are trained, one for each VTL factor, and score-level fusion is applied to make
a final decision. Next, we explore the bottleneck (BN) feature extracted by
training deep neural networks with a self-supervised objective, autoregressive
predictive coding (APC), for TD-SV and compare it with the well-studied
speaker-discriminant BN feature. The proposed VTL method is then applied to APC
and speaker-discriminant BN features. In the end, we combine the VTL
perturbation systems trained on MFCC and the two BN features in the score
domain. Experiments are performed on the RedDots challenge 2016 database of
TD-SV using short utterances with Gaussian mixture model-universal background
model and i-vector techniques. Results show the proposed methods significantly
outperform the baselines.
Related papers
- Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - On Training Targets and Activation Functions for Deep Representation
Learning in Text-Dependent Speaker Verification [18.19207291891767]
Key considerations include training targets, activation functions, and loss functions.
We study a range of loss functions when speaker identity is used as the training target.
We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
arXiv Detail & Related papers (2022-01-17T14:32:51Z) - Learning a Word-Level Language Model with Sentence-Level Noise
Contrastive Estimation for Contextual Sentence Probability Estimation [3.1040192682787415]
Inferring the probability distribution of sentences or word sequences is a key process in natural language processing.
While word-level language models (LMs) have been widely adopted for computing the joint probabilities of word sequences, they have difficulty capturing a context long enough for sentence probability estimation (SPE)
Recent studies introduced training methods using sentence-level noise-contrastive estimation (NCE) with recurrent neural networks (RNNs)
We apply our method to a simple word-level RNN LM to focus on the effect of the sentence-level NCE training rather than on the network architecture.
arXiv Detail & Related papers (2021-03-14T09:17:37Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Data Generation Using Pass-phrase-dependent Deep Auto-encoders for
Text-Dependent Speaker Verification [25.318439244029094]
We propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV)
Each PP-DNN auto-encoder is trained using the utterances of a particular pass-phrase available in the target enrollment set.
Experiments are conducted on the RedDots challenge 2016 database for TD-SV using short utterances.
arXiv Detail & Related papers (2021-02-03T14:06:29Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - On Bottleneck Features for Text-Dependent Speaker Verification Using
X-vectors [20.829997825439886]
We study x-vectors for text-dependent speaker verification (TD-SV)
We investigate the impact of the different bottleneck (BN) features on the performance of x-vectors.
Experiments are conducted on the RedDots 2016 challenge database.
arXiv Detail & Related papers (2020-05-15T07:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.