AV-SAM: Segment Anything Model Meets Audio-Visual Localization and
Segmentation
- URL: http://arxiv.org/abs/2305.01836v1
- Date: Wed, 3 May 2023 00:33:52 GMT
- Title: AV-SAM: Segment Anything Model Meets Audio-Visual Localization and
Segmentation
- Authors: Shentong Mo, Yapeng Tian
- Abstract summary: Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks.
We propose a framework based on AV-SAM that can generate sounding object masks corresponding to the audio.
We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets.
- Score: 30.756247389435803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segment Anything Model (SAM) has recently shown its powerful effectiveness in
visual segmentation tasks. However, there is less exploration concerning how
SAM works on audio-visual tasks, such as visual sound localization and
segmentation. In this work, we propose a simple yet effective audio-visual
localization and segmentation framework based on the Segment Anything Model,
namely AV-SAM, that can generate sounding object masks corresponding to the
audio. Specifically, our AV-SAM simply leverages pixel-wise audio-visual fusion
across audio features and visual features from the pre-trained image encoder in
SAM to aggregate cross-modal representations. Then, the aggregated cross-modal
features are fed into the prompt encoder and mask decoder to generate the final
audio-visual segmentation masks. We conduct extensive experiments on
Flickr-SoundNet and AVSBench datasets. The results demonstrate that the
proposed AV-SAM can achieve competitive performance on sounding object
localization and segmentation.
Related papers
- Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - MAS-SAM: Segment Any Marine Animal with Aggregated Features [55.91291540810978]
We propose a novel feature learning framework named MAS-SAM for marine animal segmentation.
Our method enables to extract richer marine information from global contextual cues to fine-grained local details.
arXiv Detail & Related papers (2024-04-24T07:38:14Z) - Weakly-Supervised Audio-Visual Segmentation [44.632423828359315]
We present a novel Weakly-Supervised Audio-Visual framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-instance contrastive learning for audio-visual segmentation.
Experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
arXiv Detail & Related papers (2023-11-25T17:18:35Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z) - Annotation-free Audio-Visual Segmentation [46.42570058385209]
We propose a novel pipeline for generating artificial data for the Audio-Visual task without extra manual annotations.
We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels.
We also introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model(SAM) to the AVS task.
arXiv Detail & Related papers (2023-05-18T14:52:45Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.