Audio-video fusion strategies for active speaker detection in meetings
- URL: http://arxiv.org/abs/2206.10411v1
- Date: Thu, 9 Jun 2022 08:20:52 GMT
- Title: Audio-video fusion strategies for active speaker detection in meetings
- Authors: Lionel Pibre, Francisco Madrigal, Cyrille Equoy, Fr\'ed\'eric Lerasle,
Thomas Pellegrini, Julien Pinquier, Isabelle Ferran\'e
- Abstract summary: We propose two types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks.
For our application context, adding motion information greatly improves performance.
We have shown that attention-based fusion improves performance while reducing the standard deviation.
- Score: 5.61861182374067
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Meetings are a common activity in professional contexts, and it remains
challenging to endow vocal assistants with advanced functionalities to
facilitate meeting management. In this context, a task like active speaker
detection can provide useful insights to model interaction between meeting
participants. Motivated by our application context related to advanced meeting
assistant, we want to combine audio and visual information to achieve the best
possible performance. In this paper, we propose two different types of fusion
for the detection of the active speaker, combining two visual modalities and an
audio modality through neural networks. For comparison purpose, classical
unsupervised approaches for audio feature extraction are also used. We expect
visual data centered on the face of each participant to be very appropriate for
detecting voice activity, based on the detection of lip and facial gestures.
Thus, our baseline system uses visual data and we chose a 3D Convolutional
Neural Network architecture, which is effective for simultaneously encoding
appearance and movement. To improve this system, we supplemented the visual
information by processing the audio stream with a CNN or an unsupervised
speaker diarization system. We have further improved this system by adding
visual modality information using motion through optical flow. We evaluated our
proposal with a public and state-of-the-art benchmark: the AMI corpus. We
analysed the contribution of each system to the merger carried out in order to
determine if a given participant is currently speaking. We also discussed the
results we obtained. Besides, we have shown that, for our application context,
adding motion information greatly improves performance. Finally, we have shown
that attention-based fusion improves performance while reducing the standard
deviation.
Related papers
- Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment.
This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models.
We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion.
We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Cross modal video representations for weakly supervised active speaker
localization [39.67239953795999]
Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
arXiv Detail & Related papers (2020-03-09T18:50:50Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.