Post-Training Embedding Alignment for Decoupling Enrollment and Runtime
Speaker Recognition Models
- URL: http://arxiv.org/abs/2401.12440v1
- Date: Tue, 23 Jan 2024 02:19:31 GMT
- Title: Post-Training Embedding Alignment for Decoupling Enrollment and Runtime
Speaker Recognition Models
- Authors: Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha,
Andreas Stolcke
- Abstract summary: We propose a lightweight neural network to map the embeddings from two independent models to a shared speaker embedding space.
Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities.
- Score: 18.50444234955465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated speaker identification (SID) is a crucial step for the
personalization of a wide range of speech-enabled services. Typical SID systems
use a symmetric enrollment-verification framework with a single model to derive
embeddings both offline for voice profiles extracted from enrollment
utterances, and online from runtime utterances. Due to the distinct
circumstances of enrollment and runtime, such as different computation and
latency constraints, several applications would benefit from an asymmetric
enrollment-verification framework that uses different models for enrollment and
runtime embedding generation. To support this asymmetric SID where each of the
two models can be updated independently, we propose using a lightweight neural
network to map the embeddings from the two independent models to a shared
speaker embedding space. Our results show that this approach significantly
outperforms cosine scoring in a shared speaker logit space for models that were
trained with a contrastive loss on large datasets with many speaker identities.
This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an
asymmetric update of only one of the models delivers at least 60% of the
performance gain achieved by updating both models in the standard symmetric SID
approach.
Related papers
- Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.