Audio-visual Saliency for Omnidirectional Videos
- URL: http://arxiv.org/abs/2311.05190v1
- Date: Thu, 9 Nov 2023 08:03:40 GMT
- Title: Audio-visual Saliency for Omnidirectional Videos
- Authors: Yuxin Zhu, Xilei Zhu, Huiyu Duan, Jie Li, Kaiwei Zhang, Yucheng Zhu,
Li Chen, Xiongkuo Min, Guangtao Zhai
- Abstract summary: We first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV)
We analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset.
- Score: 58.086575606742116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual saliency prediction for omnidirectional videos (ODVs) has shown great
significance and necessity for omnidirectional videos to help ODV coding, ODV
transmission, ODV rendering, etc.. However, most studies only consider visual
information for ODV saliency prediction while audio is rarely considered
despite its significant influence on the viewing behavior of ODV. This is
mainly due to the lack of large-scale audio-visual ODV datasets and
corresponding analysis. Thus, in this paper, we first establish the largest
audio-visual saliency dataset for omnidirectional videos (AVS-ODV), which
comprises the omnidirectional videos, audios, and corresponding captured
eye-tracking data for three video sound modalities including mute, mono, and
ambisonics. Then we analyze the visual attention behavior of the observers
under various omnidirectional audio modalities and visual scenes based on the
AVS-ODV dataset. Furthermore, we compare the performance of several
state-of-the-art saliency prediction models on the AVS-ODV dataset and
construct a new benchmark. Our AVS-ODV datasets and the benchmark will be
released to facilitate future research.
Related papers
- Temporally Aligned Audio for Video with Autoregression [17.019400481122872]
V-AURA is the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation.
VisualSound is a benchmark dataset with high audio-visual relevance.
arXiv Detail & Related papers (2024-09-20T17:59:01Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model [50.15552768350462]
This paper comprehensively investigates audio-visual attention in omnidirectional videos (ODVs) from both subjective and objective perspectives.
To advance the research on audio-visual saliency prediction for ODVs, we establish a new benchmark based on the AVS-ODV database.
arXiv Detail & Related papers (2024-08-10T02:45:46Z) - Perceptual Quality Assessment of Omnidirectional Audio-visual Signals [37.73157112698111]
Most existing quality assessment studies for omnidirectional videos (ODVs) only focus on the visual distortions of videos.
In this paper, we first establish a large-scale audio-visual quality assessment dataset for ODVs.
Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA)
arXiv Detail & Related papers (2023-07-20T12:21:26Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.