Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models
- URL: http://arxiv.org/abs/2312.03632v1
- Date: Wed, 6 Dec 2023 17:29:03 GMT
- Title: Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models
- Authors: Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis
Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
- Abstract summary: We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
- Score: 43.155061160275196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactions with virtual assistants typically start with a trigger phrase
followed by a command. In this work, we explore the possibility of making these
interactions more natural by eliminating the need for a trigger phrase. Our
goal is to determine whether a user addressed the virtual assistant based on
signals obtained from the streaming audio recorded by the device microphone. We
address this task by combining 1-best hypotheses and decoder signals from an
automatic speech recognition system with acoustic representations from an audio
encoder as input features to a large language model (LLM). In particular, we
are interested in data and resource efficient systems that require only a small
amount of training data and can operate in scenarios with only a single frozen
LLM available on a device. For this reason, our model is trained on 80k or less
examples of multimodal data using a combination of low-rank adaptation and
prefix tuning. We compare the proposed system to unimodal baselines and show
that the multimodal approach achieves lower equal-error-rates (EERs), while
using only a fraction of the training data. We also show that low-dimensional
specialized audio representations lead to lower EERs than high-dimensional
general audio representations.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase.
We train classifiers using only acoustic information obtained from the audio waveform.
We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - Introducing Model Inversion Attacks on Automatic Speaker Recognition [0.9558392439655015]
Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data.
We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics.
Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples.
We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems.
arXiv Detail & Related papers (2023-01-09T08:51:15Z) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone [43.77139614544301]
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
arXiv Detail & Related papers (2021-03-31T02:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.