Multimodal Variational Auto-encoder based Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2310.08303v1
- Date: Thu, 12 Oct 2023 13:09:40 GMT
- Title: Multimodal Variational Auto-encoder based Audio-Visual Segmentation
- Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai
- Abstract summary: ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
Our approach leads to a new state-of-the-art for audio-visual segmentation, with a 3.84 mIOU performance leap.
- Score: 46.67599800471001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources
in the video sequence. Existing AVS methods focus on implicit feature fusion
strategies, where models are trained to fit the discrete samples in the
dataset. With a limited and less diverse dataset, the resulting performance is
usually unsatisfactory. In contrast, we address this problem from an effective
representation learning perspective, aiming to model the contribution of each
modality explicitly. Specifically, we find that audio contains critical
category information of the sound producers, and visual data provides candidate
sound producer(s). Their shared information corresponds to the target sound
producer(s) shown in the visual data. In this case, cross-modal shared
representation learning is especially important for AVS. To achieve this, our
ECMVAE factorizes the representations of each modality with a modality-shared
representation and a modality-specific representation. An orthogonality
constraint is applied between the shared and specific representations to
maintain the exclusive attribute of the factorized latent code. Further, a
mutual information maximization regularizer is introduced to achieve extensive
exploration of each modality. Quantitative and qualitative evaluations on the
AVSBench demonstrate the effectiveness of our approach, leading to a new
state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging
MS3 subset for multiple sound source segmentation.
Related papers
- Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.