Multi-user VoiceFilter-Lite via Attentive Speaker Embedding
- URL: http://arxiv.org/abs/2107.01201v1
- Date: Fri, 2 Jul 2021 17:45:37 GMT
- Title: Multi-user VoiceFilter-Lite via Attentive Speaker Embedding
- Authors: Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw
- Abstract summary: We propose a solution to allow speaker conditioned speech models to support an arbitrary number of enrolled users in a single pass.
This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding.
With up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors.
- Score: 11.321747759474164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a solution to allow speaker conditioned speech
models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled
users in a single pass. This is achieved by using an attention mechanism on
multiple speaker embeddings to compute a single attentive embedding, which is
then used as a side input to the model. We implemented multi-user
VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic
speech recognition (ASR) task; (2) a text-independent speaker verification
task; and (3) a personalized keyphrase detection task, where ASR has to detect
keyphrases from multiple enrolled users in a noisy environment. Our experiments
show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able
to significantly reduce speech recognition and speaker verification errors when
there is overlapping speech, without affecting performance under other acoustic
conditions. This attentive speaker embedding approach can also be easily
applied to other speaker-conditioned models such as personal VAD and
personalized ASR.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Closing the Gap between Single-User and Multi-User VoiceFilter-Lite [13.593557171761782]
VoiceFilter-Lite is a speaker-conditioned voice separation model.
It plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers.
In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model.
We successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations.
arXiv Detail & Related papers (2022-02-24T16:10:16Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.