How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
- URL: http://arxiv.org/abs/2408.05411v1
- Date: Sat, 10 Aug 2024 02:45:46 GMT
- Title: How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
- Authors: Yuxin Zhu, Huiyu Duan, Kaiwei Zhang, Yucheng Zhu, Xilei Zhu, Long Teng, Xiongkuo Min, Guangtao Zhai,
- Abstract summary: This paper comprehensively investigates audio-visual attention in omnidirectional videos (ODVs) from both subjective and objective perspectives.
To advance the research on audio-visual saliency prediction for ODVs, we establish a new benchmark based on the AVS-ODV database.
- Score: 50.15552768350462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - Audio-visual Saliency for Omnidirectional Videos [58.086575606742116]
We first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV)
We analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset.
arXiv Detail & Related papers (2023-11-09T08:03:40Z) - Perceptual Quality Assessment of Omnidirectional Audio-visual Signals [37.73157112698111]
Most existing quality assessment studies for omnidirectional videos (ODVs) only focus on the visual distortions of videos.
In this paper, we first establish a large-scale audio-visual quality assessment dataset for ODVs.
Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA)
arXiv Detail & Related papers (2023-07-20T12:21:26Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - STAViS: Spatio-Temporal AudioVisual Saliency Network [45.04894808904767]
STAViS is a network that combines visual saliency and auditory features.
It learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map.
We compare our method against 8 different state-of-the-art visual saliency models.
arXiv Detail & Related papers (2020-01-09T15:34:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.