Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos
- URL: http://arxiv.org/abs/2111.08567v1
- Date: Fri, 5 Nov 2021 14:35:08 GMT
- Title: Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos
- Authors: Minglang Qiao, Yufan Liu, Mai Xu, Xin Deng, Bing Li, Weiming Hu, Ali
Borji
- Abstract summary: We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
- Score: 101.83513408195692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual and audio events simultaneously occur and both attract attention.
However, most existing saliency prediction works ignore the influence of audio
and only consider vision modality. In this paper, we propose a multitask
learning method for visual-audio saliency prediction and sound source
localization on multi-face video by leveraging visual, audio and face
information. Specifically, we first introduce a large-scale database of
multi-face video in visual-audio condition (MVVA), containing eye-tracking data
and sound source annotations. Using this database, we find that sound
influences human attention, and conversly attention offers a cue to determine
sound source on multi-face video. Guided by these findings, a visual-audio
multi-task network (VAM-Net) is introduced to predict saliency and locate sound
source. VAM-Net consists of three branches corresponding to visual, audio and
face modalities. Visual branch has a two-stream architecture to capture spatial
and temporal information. Face and audio branches encode audio signals and
faces, respectively. Finally, a spatio-temporal multi-modal graph (STMG) is
constructed to model the interaction among multiple faces. With joint
optimization of these branches, the intrinsic correlation of the tasks of
saliency prediction and sound source localization is utilized and their
performance is boosted by each other. Experiments show that the proposed method
outperforms 12 state-of-the-art saliency prediction methods, and achieves
competitive results in sound source localization.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation [35.93516937521393]
We introduce TriBERT -- a transformer-based architecture inspired by ViLBERT.
TriBERT enables contextual feature learning across three modalities: vision, pose, and audio.
We show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks.
arXiv Detail & Related papers (2021-10-26T04:50:42Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.