Related papers: Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

URL: http://arxiv.org/abs/2309.06728v1
Date: Wed, 13 Sep 2023 05:05:47 GMT
Title: Leveraging Foundation models for Unsupervised Audio-Visual Segmentation
Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Xiatian Zhu
Abstract summary: Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
Score: 49.94366155560371
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their scalability since it is time consuming and tedious to acquire such cross-modality pixel level labels. To overcome this obstacle, in this work we introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training. For tackling this newly proposed problem, we formulate a novel Cross-Modality Semantic Filtering (CMSF) approach to accurately associate the underlying audio-mask pairs by leveraging the off-the-shelf multi-modal foundation models (e.g., detection [1], open-world segmentation [2] and multi-modal alignment [3]). Guiding the proposal generation by either audio or visual cues, we design two training-free variants: AT-GDINO-SAM and OWOD-BIND. Extensive experiments on the AVS-Bench dataset show that our unsupervised approach can perform well in comparison to prior art supervised counterparts across complex scenarios with multiple auditory objects. Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects. Our code will be publicly released.

Related papers

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation [83.75249714794977]
We present Crab$+$, a scalable and unified audio-visual scene understanding model.<n>On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset.<n>On the model side, we design a unified interface to align heterogeneous task formulations.<n>We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks.
arXiv Detail & Related papers (2026-03-04T14:43:57Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing [26.317163478761916]
Weakly-supervised audio-visual video parsing seeks to detect audible, visible, and audio-visual events without temporal annotations.<n>We propose an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks.<n>We also propose a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs.
arXiv Detail & Related papers (2025-09-17T15:38:05Z)
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z)
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models [28.56745509698125]
We propose OpenAVS, a training-free language-based approach to align audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual (AVS)<n>OpenAVS infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation.<n>It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively.
arXiv Detail & Related papers (2025-04-30T01:52:10Z)
Audio Visual Segmentation Through Text Embeddings [17.285669984798975]
We propose textbfAV2T-SAM, a framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our approach outperforms existing methods by effectively utilizing pretrained segmentation models and cross-modal semantic alignment.
arXiv Detail & Related papers (2025-02-22T21:15:44Z)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task. Our framework incorporates two key components for video understanding and cross-modal learning. Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Unsupervised Audio-Visual Segmentation with Modality Alignment [42.613786372067814]
Audio-Visual aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. We propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models.
arXiv Detail & Related papers (2024-03-21T07:56:09Z)
Weakly-Supervised Audio-Visual Segmentation [44.632423828359315]
We present a novel Weakly-Supervised Audio-Visual framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-instance contrastive learning for audio-visual segmentation. Experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
arXiv Detail & Related papers (2023-11-25T17:18:35Z)
Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation. This framework establishes robust correlations between an object's visual characteristics and its associated sound. We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z)
Annotation-free Audio-Visual Segmentation [46.42570058385209]
We propose a novel pipeline for generating artificial data for the Audio-Visual task without extra manual annotations. We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels. We also introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model(SAM) to the AVS task.
arXiv Detail & Related papers (2023-05-18T14:52:45Z)
Multi-Granularity Denoising and Bidirectional Alignment for Weakly Supervised Semantic Segmentation [75.32213865436442]
We propose an end-to-end multi-granularity denoising and bidirectional alignment (MDBA) model to alleviate the noisy label and multi-class generalization issues. The MDBA model can reach the mIoU of 69.5% and 70.2% on validation and test sets for the PASCAL VOC 2012 dataset.
arXiv Detail & Related papers (2023-05-09T03:33:43Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets. We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)
DETA: Denoised Task Adaptation for Few-Shot Learning [135.96805271128645]
Test-time task adaptation in few-shot learning aims to adapt a pre-trained task-agnostic model for capturing taskspecific knowledge. With only a handful of samples available, the adverse effect of either the image noise (a.k.a. X-noise) or the label noise (a.k.a. Y-noise) from support samples can be severely amplified. We propose DEnoised Task Adaptation (DETA), a first, unified image- and label-denoising framework to existing task adaptation approaches.
arXiv Detail & Related papers (2023-03-11T05:23:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.