Improving Audio-Visual Segmentation with Bidirectional Generation
- URL: http://arxiv.org/abs/2308.08288v2
- Date: Tue, 19 Dec 2023 07:50:23 GMT
- Title: Improving Audio-Visual Segmentation with Bidirectional Generation
- Authors: Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong
- Abstract summary: We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
- Score: 40.78395709407226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aim of audio-visual segmentation (AVS) is to precisely differentiate
audible objects within videos down to the pixel level. Traditional approaches
often tackle this challenge by combining information from various modalities,
where the contribution of each modality is implicitly or explicitly modeled.
Nevertheless, the interconnections between different modalities tend to be
overlooked in audio-visual modeling. In this paper, inspired by the human
ability to mentally simulate the sound of an object and its visual appearance,
we introduce a bidirectional generation framework. This framework establishes
robust correlations between an object's visual characteristics and its
associated sound, thereby enhancing the performance of AVS. To achieve this, we
employ a visual-to-audio projection component that reconstructs audio features
from object segmentation masks and minimizes reconstruction errors. Moreover,
recognizing that many sounds are linked to object movements, we introduce an
implicit volumetric motion estimation module to handle temporal dynamics that
may be challenging to capture using conventional optical flow methods. To
showcase the effectiveness of our approach, we conduct comprehensive
experiments and analyses on the widely recognized AVSBench benchmark. As a
result, we establish a new state-of-the-art performance level in the AVS
benchmark, particularly excelling in the challenging MS3 subset which involves
segmenting multiple sound sources. To facilitate reproducibility, we plan to
release both the source code and the pre-trained model.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio.
We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z) - Multimodal Variational Auto-encoder based Audio-Visual Segmentation [46.67599800471001]
ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
Our approach leads to a new state-of-the-art for audio-visual segmentation, with a 3.84 mIOU performance leap.
arXiv Detail & Related papers (2023-10-12T13:09:40Z) - QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.