Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles
- URL: http://arxiv.org/abs/2310.11379v1
- Date: Tue, 17 Oct 2023 16:22:18 GMT
- Title: Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles
- Authors: Fernando L\'opez, Jordi Luque, Carlos Segura, Pablo G\'omez
- Abstract summary: It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
- Score: 48.208214762257136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice-based interfaces rely on a wake-up word mechanism to initiate
communication with devices. However, achieving a robust, energy-efficient, and
fast detection remains a challenge. This paper addresses these real production
needs by enhancing data with temporal alignments and using detection based on
two phases with multi-resolution. It employs two models: a lightweight
on-device model for real-time processing of the audio stream and a verification
model on the server-side, which is an ensemble of heterogeneous architectures
that refine detection. This scheme allows the optimization of two operating
points. To protect privacy, audio features are sent to the cloud instead of raw
audio. The study investigated different parametric configurations for feature
extraction to select one for on-device detection and another for the
verification model. Furthermore, thirteen different audio classifiers were
compared in terms of performance and inference time. The proposed ensemble
outperforms our stronger classifier in every noise condition.
Related papers
- Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech.
AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level.
AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - End-To-End Audiovisual Feature Fusion for Active Speaker Detection [7.631698269792165]
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform.
Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
arXiv Detail & Related papers (2022-07-27T10:25:59Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.