Panoramic Video Salient Object Detection with Ambisonic Audio Guidance
- URL: http://arxiv.org/abs/2211.14419v1
- Date: Sat, 26 Nov 2022 00:50:02 GMT
- Title: Panoramic Video Salient Object Detection with Ambisonic Audio Guidance
- Authors: Xiang Li, Haoyuan Cao, Shijie Zhao, Junlin Li, Li Zhang, Bhiksha Raj
- Abstract summary: We propose a multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion blocks.
The block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources.
Our method achieves state-of-the-art performance on the ASOD60K dataset.
- Score: 24.341735475632884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video salient object detection (VSOD), as a fundamental computer vision
problem, has been extensively discussed in the last decade. However, all
existing works focus on addressing the VSOD problem in 2D scenarios. With the
rapid development of VR devices, panoramic videos have been a promising
alternative to 2D videos to provide immersive feelings of the real world. In
this paper, we aim to tackle the video salient object detection problem for
panoramic videos, with their corresponding ambisonic audios. A multimodal
fusion module equipped with two pseudo-siamese audio-visual context fusion
(ACF) blocks is proposed to effectively conduct audio-visual interaction. The
ACF block equipped with spherical positional encoding enables the fusion in the
3D context to capture the spatial correspondence between pixels and sound
sources from the equirectangular frames and ambisonic audios. Experimental
results verify the effectiveness of our proposed components and demonstrate
that our method achieves state-of-the-art performance on the ASOD60K dataset.
Related papers
- IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation [136.5813547244979]
We present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation.
Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation.
Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields.
arXiv Detail & Related papers (2024-07-15T17:36:54Z) - EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous
Driving [67.82112360246025]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving.
Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking.
We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
arXiv Detail & Related papers (2024-02-28T12:50:16Z) - Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation [26.85397648493918]
We propose an innovative audio-visual transformer framework, COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns.
For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement.
Experiments and ablation studies on AVSBench-object and AVSBench-semantic datasets demonstrate that COMBO surpasses previous state-of-the-art methods.
arXiv Detail & Related papers (2023-12-11T15:51:38Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required.
We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information.
Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.