Related papers: Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

URL: http://arxiv.org/abs/2211.14419v1
Date: Sat, 26 Nov 2022 00:50:02 GMT
Title: Panoramic Video Salient Object Detection with Ambisonic Audio Guidance
Authors: Xiang Li, Haoyuan Cao, Shijie Zhao, Junlin Li, Li Zhang, Bhiksha Raj
Abstract summary: We propose a multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion blocks. The block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources. Our method achieves state-of-the-art performance on the ASOD60K dataset.
Score: 24.341735475632884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video salient object detection (VSOD), as a fundamental computer vision problem, has been extensively discussed in the last decade. However, all existing works focus on addressing the VSOD problem in 2D scenarios. With the rapid development of VR devices, panoramic videos have been a promising alternative to 2D videos to provide immersive feelings of the real world. In this paper, we aim to tackle the video salient object detection problem for panoramic videos, with their corresponding ambisonic audios. A multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion (ACF) blocks is proposed to effectively conduct audio-visual interaction. The ACF block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources from the equirectangular frames and ambisonic audios. Experimental results verify the effectiveness of our proposed components and demonstrate that our method achieves state-of-the-art performance on the ASOD60K dataset.

Related papers

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation [6.631248829195371]
We introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame.<n>To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual (MVE) to obtain object-level visual features aligned with audio.<n>To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score.
arXiv Detail & Related papers (2025-07-07T13:01:50Z)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task. Our framework incorporates two key components for video understanding and cross-modal learning. Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera [61.642416712939095]
SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset.
arXiv Detail & Related papers (2024-12-22T05:04:17Z)
3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models. Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z)
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation [136.5813547244979]
We present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields.
arXiv Detail & Related papers (2024-07-15T17:36:54Z)
EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z)
Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation [26.85397648493918]
We propose an innovative audio-visual transformer framework, COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Experiments and ablation studies on AVSBench-object and AVSBench-semantic datasets demonstrate that COMBO surpasses previous state-of-the-art methods.
arXiv Detail & Related papers (2023-12-11T15:51:38Z)
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames. We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z)
Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z)
Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z)
Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time. With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions. Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z)
ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD) We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source. Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.