Weakly-Supervised Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2311.15080v1
- Date: Sat, 25 Nov 2023 17:18:35 GMT
- Title: Weakly-Supervised Audio-Visual Segmentation
- Authors: Shentong Mo, Bhiksha Raj
- Abstract summary: We present a novel Weakly-Supervised Audio-Visual framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-instance contrastive learning for audio-visual segmentation.
Experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
- Score: 44.632423828359315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual segmentation is a challenging task that aims to predict
pixel-level masks for sound sources in a video. Previous work applied a
comprehensive manually designed architecture with countless pixel-wise accurate
masks as supervision. However, these pixel-level masks are expensive and not
available in all cases. In this work, we aim to simplify the supervision as the
instance-level annotation, i.e., weakly-supervised audio-visual segmentation.
We present a novel Weakly-Supervised Audio-Visual Segmentation framework,
namely WS-AVS, that can learn multi-scale audio-visual alignment with
multi-scale multiple-instance contrastive learning for audio-visual
segmentation. Extensive experiments on AVSBench demonstrate the effectiveness
of our WS-AVS in the weakly-supervised audio-visual segmentation of
single-source and multi-source scenarios.
Related papers
- Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video.
This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges.
We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z) - AV-SAM: Segment Anything Model Meets Audio-Visual Localization and
Segmentation [30.756247389435803]
Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks.
We propose a framework based on AV-SAM that can generate sounding object masks corresponding to the audio.
We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets.
arXiv Detail & Related papers (2023-05-03T00:33:52Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.