Meta-Learning for Short Utterance Speaker Recognition with Imbalance
Length Pairs
- URL: http://arxiv.org/abs/2004.02863v5
- Date: Tue, 11 Aug 2020 02:21:51 GMT
- Title: Meta-Learning for Short Utterance Speaker Recognition with Imbalance
Length Pairs
- Authors: Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung Ju Hwang, Hoirin Kim
- Abstract summary: We introduce a meta-learning framework for imbalance length pairs.
We train it with a support set of long utterances and a query set of short utterances of varying lengths.
By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models.
- Score: 65.28795726837386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In practical settings, a speaker recognition system needs to identify a
speaker given a short utterance, while the enrollment utterance may be
relatively long. However, existing speaker recognition models perform poorly
with such short utterances. To solve this problem, we introduce a meta-learning
framework for imbalance length pairs. Specifically, we use a Prototypical
Networks and train it with a support set of long utterances and a query set of
short utterances of varying lengths. Further, since optimizing only for the
classes in the given episode may be insufficient for learning discriminative
embeddings for unseen classes, we additionally enforce the model to classify
both the support and the query set against the entire set of classes in the
training set. By combining these two learning schemes, our model outperforms
existing state-of-the-art speaker verification models learned with a standard
supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb
datasets. We also validate our proposed model for unseen speaker
identification, on which it also achieves significant performance gains over
the existing approaches. The codes are available at
https://github.com/seongmin-kye/meta-SR.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.