Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction
- URL: http://arxiv.org/abs/2109.08371v1
- Date: Fri, 17 Sep 2021 06:49:43 GMT
- Title: Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction
- Authors: Yuan Yuan, Hailong Ning, and Bin Zhao
- Abstract summary: Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
- Score: 15.679379904130908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Attention Prediction (VAP) methods simulates the human selective
attention mechanism to perceive the scene, which is significant and imperative
in many vision tasks. Most existing methods only consider visual cues, while
neglect the accompanied audio information, which can provide complementary
information for the scene understanding. In fact, there exists a strong
relation between auditory and visual cues, and humans generally perceive the
surrounding scene by simultaneously sensing these cues. Motivated by this, a
bio-inspired audio-visual cues integration method is proposed for the VAP task,
which explores the audio modality to better predict the visual attention map by
assisting vision modality. The proposed method consists of three parts: 1)
audio-visual encoding, 2) audio-visual location, and 3) multi-cues aggregation
parts. Firstly, a refined SoundNet architecture is adopted to encode audio
modality for obtaining corresponding features, and a modified 3D ResNet-50
architecture is employed to learn visual features, containing both spatial
location and temporal motion information. Secondly, an audio-visual location
part is devised to locate the sound source in the visual scene by learning the
correspondence between audio-visual information. Thirdly, a multi-cues
aggregation part is devised to adaptively aggregate audio-visual information
and center-bias prior to generate the final visual attention map. Extensive
experiments are conducted on six challenging audiovisual eye-tracking datasets,
including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows
significant superiority over state-of-the-art visual attention models.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Estimating Visual Information From Audio Through Manifold Learning [14.113590443352495]
We propose a new framework for extracting visual information about a scene only using audio signals.
Our framework is based on Manifold Learning and consists of two steps.
We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset.
arXiv Detail & Related papers (2022-08-03T20:47:11Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - A proto-object based audiovisual saliency map [0.0]
We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes.
Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
arXiv Detail & Related papers (2020-03-15T08:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.