Related papers: Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

URL: http://arxiv.org/abs/2401.12440v1
Date: Tue, 23 Jan 2024 02:19:31 GMT
Title: Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models
Authors: Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha, Andreas Stolcke
Abstract summary: We propose a lightweight neural network to map the embeddings from two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities.
Score: 18.50444234955465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.

Related papers

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one. We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z)
Continual Learning for On-Device Speech Recognition using Disentangled Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks. We propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z)
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z)
Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition. The t-SOT model has the advantages of less inference cost and a simpler model architecture. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z)
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z)
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.