Multi-task Learning for Speaker Verification and Voice Trigger Detection
- URL: http://arxiv.org/abs/2001.10816v1
- Date: Sun, 26 Jan 2020 21:19:27 GMT
- Title: Multi-task Learning for Speaker Verification and Voice Trigger Detection
- Authors: Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John
Bridle
- Abstract summary: We investigate training a single network to perform both tasks jointly.
We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data.
Results demonstrate that the network is able to encode both phonetic emphand speaker information in its learnt representations.
- Score: 18.51531434428444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech transcription and speaker recognition are usually treated as
separate tasks even though they are interdependent. In this study, we
investigate training a single network to perform both tasks jointly. We train
the network in a supervised multi-task learning setup, where the speech
transcription branch of the network is trained to minimise a phonetic
connectionist temporal classification (CTC) loss while the speaker recognition
branch of the network is trained to label the input sequence with the correct
label for the speaker. We present a large-scale empirical study where the model
is trained using several thousand hours of labelled training data for each
task. We evaluate the speech transcription branch of the network on a voice
trigger detection task while the speaker recognition branch is evaluated on a
speaker verification task. Results demonstrate that the network is able to
encode both phonetic \emph{and} speaker information in its learnt
representations while yielding accuracies at least as good as the baseline
models for each task, with the same number of parameters as the independent
models.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - Multi-task Learning with Cross Attention for Keyword Spotting [8.103605110339519]
Keywords spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase.
There is a mismatch between the training criterion (phoneme recognition) and the target task (KWS)
Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data.
arXiv Detail & Related papers (2021-07-15T22:38:16Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Untangling in Invariant Speech Recognition [17.996356271398295]
We study how information is untangled within neural networks trained to recognize speech.
We observe speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties are untangled in later layers.
We find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation.
arXiv Detail & Related papers (2020-03-03T20:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.