End-To-End Audiovisual Feature Fusion for Active Speaker Detection
- URL: http://arxiv.org/abs/2207.13434v1
- Date: Wed, 27 Jul 2022 10:25:59 GMT
- Title: End-To-End Audiovisual Feature Fusion for Active Speaker Detection
- Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong
Wu
- Abstract summary: This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform.
Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
- Score: 7.631698269792165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Active speaker detection plays a vital role in human-machine interaction.
Recently, a few end-to-end audiovisual frameworks emerged. However, these
models' inference time was not explored and are not applicable for real-time
applications due to their complexity and large input size. In addition, they
explored a similar feature extraction strategy that employs the ConvNet on
audio and visual inputs. This work presents a novel two-stream end-to-end
framework fusing features extracted from images via VGG-M with raw Mel
Frequency Cepstrum Coefficients features extracted from the audio waveform. The
network has two BiGRU layers attached to each stream to handle each stream's
temporal dynamic before fusion. After fusion, one BiGRU layer is attached to
model the joint temporal dynamics. The experiment result on the
AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy
shows more robustness to noisy signals and better inference time than models
that employed ConvNet on both modalities. The proposed model predicts within
44.41 ms, which is fast enough for real-time applications. Our best-performing
model attained 88.929% accuracy, nearly the same detection result as
state-of-the-art -work.
Related papers
- Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - STMixer: A One-Stage Sparse Action Detector [48.0614066856134]
We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
arXiv Detail & Related papers (2023-03-28T10:47:06Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.