Audio-visual Saliency for Omnidirectional Videos
- URL: http://arxiv.org/abs/2311.05190v1
- Date: Thu, 9 Nov 2023 08:03:40 GMT
- Title: Audio-visual Saliency for Omnidirectional Videos
- Authors: Yuxin Zhu, Xilei Zhu, Huiyu Duan, Jie Li, Kaiwei Zhang, Yucheng Zhu,
Li Chen, Xiongkuo Min, Guangtao Zhai
- Abstract summary: We first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV)
We analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset.
- Score: 58.086575606742116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual saliency prediction for omnidirectional videos (ODVs) has shown great
significance and necessity for omnidirectional videos to help ODV coding, ODV
transmission, ODV rendering, etc.. However, most studies only consider visual
information for ODV saliency prediction while audio is rarely considered
despite its significant influence on the viewing behavior of ODV. This is
mainly due to the lack of large-scale audio-visual ODV datasets and
corresponding analysis. Thus, in this paper, we first establish the largest
audio-visual saliency dataset for omnidirectional videos (AVS-ODV), which
comprises the omnidirectional videos, audios, and corresponding captured
eye-tracking data for three video sound modalities including mute, mono, and
ambisonics. Then we analyze the visual attention behavior of the observers
under various omnidirectional audio modalities and visual scenes based on the
AVS-ODV dataset. Furthermore, we compare the performance of several
state-of-the-art saliency prediction models on the AVS-ODV dataset and
construct a new benchmark. Our AVS-ODV datasets and the benchmark will be
released to facilitate future research.
Related papers
- Audio-visual training for improved grounding in video-text LLMs [1.9320359360360702]
We propose a model architecture that handles audio-visual inputs explicitly.
We train our model with both audio and visual data from a video instruction-tuning dataset.
For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset.
arXiv Detail & Related papers (2024-07-21T03:59:14Z) - Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Perceptual Quality Assessment of Omnidirectional Audio-visual Signals [37.73157112698111]
Most existing quality assessment studies for omnidirectional videos (ODVs) only focus on the visual distortions of videos.
In this paper, we first establish a large-scale audio-visual quality assessment dataset for ODVs.
Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA)
arXiv Detail & Related papers (2023-07-20T12:21:26Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - The Role of the Input in Natural Language Video Description [60.03448250024277]
Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing, Multimedia, and Autonomous Robotics communities.
In this work, it is presented an extensive study dealing with the role of the visual input, evaluated with respect to the overall NLP performance.
A t-SNE based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution.
arXiv Detail & Related papers (2021-02-09T19:00:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.