Efficient Multimodal Neural Networks for Trigger-less Voice Assistants
- URL: http://arxiv.org/abs/2305.12063v1
- Date: Sat, 20 May 2023 02:52:02 GMT
- Title: Efficient Multimodal Neural Networks for Trigger-less Voice Assistants
- Authors: Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Tashweena Heeramun, Karan
Sawnhey, Ed Yanosik, Saravana Rathinam, Saurabh Adya
- Abstract summary: We propose a neural network based audio-gesture multimodal fusion system for smartwatches.
The system better understands temporal correlation between audio and gesture data, leading to precise invocations.
It is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times.
- Score: 0.8209843760716959
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The adoption of multimodal interactions by Voice Assistants (VAs) is growing
rapidly to enhance human-computer interactions. Smartwatches have now
incorporated trigger-less methods of invoking VAs, such as Raise To Speak
(RTS), where the user raises their watch and speaks to VAs without an explicit
trigger. Current state-of-the-art RTS systems rely on heuristics and engineered
Finite State Machines to fuse gesture and audio data for multimodal
decision-making. However, these methods have limitations, including limited
adaptability, scalability, and induced human biases. In this work, we propose a
neural network based audio-gesture multimodal fusion system that (1) Better
understands temporal correlation between audio and gesture data, leading to
precise invocations (2) Generalizes to a wide range of environments and
scenarios (3) Is lightweight and deployable on low-power devices, such as
smartwatches, with quick launch times (4) Improves productivity in asset
development processes.
Related papers
- Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction [110.38946048535033]
This paper introduces Step-Audio, the first production-ready open-source solution for speech recognition.
Key contributions include: 1) a unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex
arXiv Detail & Related papers (2025-02-17T15:58:56Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.
We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.
Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition [57.131546757903834]
Lyra is an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction.
Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
arXiv Detail & Related papers (2024-12-12T17:50:39Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase.
We train classifiers using only acoustic information obtained from the audio waveform.
We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer.
Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.