Improving Voice Trigger Detection with Metric Learning
- URL: http://arxiv.org/abs/2204.02455v1
- Date: Tue, 5 Apr 2022 18:59:27 GMT
- Title: Improving Voice Trigger Detection with Metric Learning
- Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen
Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho,
Saurabh Adya, Chandra Dhir, Ahmed Tewfik
- Abstract summary: We propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy.
A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance.
Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate.
- Score: 15.531040328839639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice trigger detection is an important task, which enables activating a
voice assistant when a target user speaks a keyword phrase. A detector is
typically trained on speech data independent of speaker information and used
for the voice trigger detection task. However, such a speaker independent voice
trigger detector typically suffers from performance degradation on speech from
underrepresented groups, such as accented speakers. In this work, we propose a
novel voice trigger detector that can use a small number of utterances from a
target speaker to improve detection accuracy. Our proposed model employs an
encoder-decoder architecture. While the encoder performs speaker independent
voice trigger detection, similar to the conventional detector, the decoder
predicts a personalized embedding for each utterance. A personalized voice
trigger score is then obtained as a similarity score between the embeddings of
enrollment utterances and a test utterance. The personalized embedding allows
adapting to target speaker's speech when computing the voice trigger score,
hence improving voice trigger detection accuracy. Experimental results show
that the proposed approach achieves a 38% relative reduction in a false
rejection rate (FRR) compared to a baseline speaker independent voice trigger
model.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS [36.023566245506046]
We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech.
The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space.
Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
arXiv Detail & Related papers (2022-06-21T11:08:05Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - Personalized Keyphrase Detection using Speaker and Environment
Information [24.766475943042202]
We introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary.
The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model.
arXiv Detail & Related papers (2021-04-28T18:50:19Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - Knowledge Transfer for Efficient On-device False Trigger Mitigation [17.53768388104929]
An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacy-centric smart assistant.
We propose an LSTM-based FTM architecture which determines the user intent from acoustic features directly without explicitly generating ASR transcripts.
arXiv Detail & Related papers (2020-10-20T20:01:44Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.