Modality Dropout for Multimodal Device Directed Speech Detection using
Verbal and Non-Verbal Features
- URL: http://arxiv.org/abs/2310.15261v1
- Date: Mon, 23 Oct 2023 18:09:31 GMT
- Title: Modality Dropout for Multimodal Device Directed Speech Detection using
Verbal and Non-Verbal Features
- Authors: Gautam Krishna, Sameer Dharur, Oggi Rudovic, Pranay Dighe, Saurabh
Adya, Ahmed Hussen Abdelaziz, Ahmed H Tewfik
- Abstract summary: We study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for device-directed speech detection (DDSD)
We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves performance by upto 8.5% in terms of false acceptance rate (FA)
Our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.
- Score: 11.212228410835435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Device-directed speech detection (DDSD) is the binary classification task of
distinguishing between queries directed at a voice assistant versus side
conversation or background speech. State-of-the-art DDSD systems use verbal
cues, e.g acoustic, text and/or automatic speech recognition system (ASR)
features, to classify speech as device-directed or otherwise, and often have to
contend with one or more of these modalities being unavailable when deployed in
real-world settings. In this paper, we investigate fusion schemes for DDSD
systems that can be made more robust to missing modalities. Concurrently, we
study the use of non-verbal cues, specifically prosody features, in addition to
verbal cues for DDSD. We present different approaches to combine scores and
embeddings from prosody with the corresponding verbal cues, finding that
prosody improves DDSD performance by upto 8.5% in terms of false acceptance
rate (FA) at a given fixed operating point via non-linear intermediate fusion,
while our use of modality dropout techniques improves the performance of these
models by 7.4% in terms of FA when evaluated with missing modalities during
inference time.
Related papers
- End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients [6.626696929949397]
We propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability.
The proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
arXiv Detail & Related papers (2024-05-23T12:24:01Z) - End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations [13.020158123538138]
Speech separation guided diarization (SSGD) performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream.
We consider three state-of-the-art speech separation (SSep) algorithms and study their performance in online and offline scenarios.
We show that our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model.
arXiv Detail & Related papers (2023-03-21T16:33:56Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - Improved far-field speech recognition using Joint Variational
Autoencoder [5.320201231911981]
We propose mapping speech features from far-field to close-talk using denoising autoencoder (DA)
Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
arXiv Detail & Related papers (2022-04-24T14:14:04Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.