Annotation-free Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2305.11019v4
- Date: Sat, 7 Oct 2023 07:57:15 GMT
- Title: Annotation-free Audio-Visual Segmentation
- Authors: Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, Weidi Xie
- Abstract summary: We propose a novel pipeline for generating artificial data for the Audio-Visual task without extra manual annotations.
We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels.
We also introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model(SAM) to the AVS task.
- Score: 46.42570058385209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation
masks. To tackle the task, it involves a comprehensive consideration of both
the data and model aspects. In this paper, first, we initiate a novel pipeline
for generating artificial data for the AVS task without extra manual
annotations. We leverage existing image segmentation and audio datasets and
match the image-mask pairs with its corresponding audio samples using category
labels in segmentation datasets, that allows us to effortlessly compose (image,
audio, mask) triplets for training AVS models. The pipeline is annotation-free
and scalable to cover a large number of categories. Additionally, we introduce
a lightweight model SAMA-AVS which adapts the pre-trained segment anything
model~(SAM) to the AVS task. By introducing only a small number of trainable
parameters with adapters, the proposed model can effectively achieve adequate
audio-visual fusion and interaction in the encoding stage with vast majority of
parameters fixed. We conduct extensive experiments, and the results show our
proposed model remarkably surpasses other competing methods. Moreover, by using
the proposed model pretrained with our synthetic data, the performance on real
AVSBench data is further improved, achieving 83.17 mIoU on S4 subset and 66.95
mIoU on MS3 set. The project page is
https://jinxiang-liu.github.io/anno-free-AVS/.
Related papers
- SAVE: Segment Audio-Visual Easy way using Segment Anything Model [0.0]
This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task.
Our proposed model achieves effective audio-visual fusion and interaction during the encoding stage.
arXiv Detail & Related papers (2024-07-02T07:22:28Z) - Unsupervised Audio-Visual Segmentation with Modality Alignment [42.613786372067814]
Audio-Visual aims to identify, at the pixel level, the object in a visual scene that produces a given sound.
Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability.
We propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models.
arXiv Detail & Related papers (2024-03-21T07:56:09Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - AV-SAM: Segment Anything Model Meets Audio-Visual Localization and
Segmentation [30.756247389435803]
Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks.
We propose a framework based on AV-SAM that can generate sounding object masks corresponding to the audio.
We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets.
arXiv Detail & Related papers (2023-05-03T00:33:52Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.