Exploring the Use of an Unsupervised Autoregressive Model as a Shared
Encoder for Text-Dependent Speaker Verification
- URL: http://arxiv.org/abs/2008.03615v1
- Date: Sat, 8 Aug 2020 22:47:10 GMT
- Title: Exploring the Use of an Unsupervised Autoregressive Model as a Shared
Encoder for Text-Dependent Speaker Verification
- Authors: Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu and Abeer Alwan
- Abstract summary: We propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders.
We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks.
- Score: 22.894402178709136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel way of addressing text-dependent automatic
speaker verification (TD-ASV) by using a shared-encoder with task-specific
decoders. An autoregressive predictive coding (APC) encoder is pre-trained in
an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and
in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature
representation that encapsulates speaker and phonetic content. Two
task-specific decoders were trained using labeled datasets to classify speakers
(SID) and phrases (PID). Speaker embeddings extracted from the SID decoder were
scored using a PLDA. SID and PID systems were fused at the score level. There
is a 51.9% relative improvement in minDCF for our system compared to the fully
supervised x-vector baseline on the cross-lingual DeepMine dataset. However,
the i-vector/HMM method outperformed the proposed APC encoder-decoder system. A
fusion of the x-vector/PLDA baseline and the SID/PLDA scores prior to PID
fusion further improved performance by 15% indicating complementarity of the
proposed approach to the x-vector system. We show that the proposed approach
can leverage from large, unlabeled, data-rich domains, and learn speech
patterns independent of downstream tasks. Such a system can provide competitive
performance in domain-mismatched scenarios where test data is from data-scarce
domains.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification [6.975902383951604]
Current methodologies face difficulties with the unpredictable distribution of outliers.
We present the Dual for Threshold-Based Re-Classification (DETER) to address these challenges.
Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents.
arXiv Detail & Related papers (2024-05-30T11:46:42Z) - Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for
Speaker Diarization [41.24045486520547]
We propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN)
The proposed E-SHARC framework improves significantly over the state-of-art diarization systems.
arXiv Detail & Related papers (2024-01-23T15:35:44Z) - UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation [53.06337011259031]
We introduce UnFuSeD, a novel approach to leverage self-supervised learning for audio classification.
We use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step.
UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines.
arXiv Detail & Related papers (2023-03-10T02:43:36Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.