Related papers: SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

URL: http://arxiv.org/abs/2506.01558v1
Date: Mon, 02 Jun 2025 11:36:25 GMT
Title: SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
Authors: Yuji Wang, Haoran Xu, Yong Liu, Jiaze Li, Yansong Tang,
Abstract summary: We introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token.<n>Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2.<n>We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5% in $calmathJ&F$ on the Ref-AVS benchmark.
Score: 30.870903750545004
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5\% in $\mathcal{J\&F}$ on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available here.

Related papers

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z)
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting [23.76682709034273]
AuralSAM2, comprises the novel AuralFuser module, which attaches externally to SAM2 to integrate features from different modalities.<n>This integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness.<n>Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field.
arXiv Detail & Related papers (2025-06-01T13:57:42Z)
4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.<n>Our framework incorporates two key components for video understanding and cross-modal learning.<n>Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [110.3379755761583]
Sa2VA is a unified model for grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>We show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation.
arXiv Detail & Related papers (2025-01-07T18:58:54Z)
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [4.166500345728911]
Referring Video Object (RVOS) relies on natural language expressions to segment an object in a video clip.<n>We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities.<n>We introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process.
arXiv Detail & Related papers (2024-11-26T18:10:54Z)
Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder. It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams. Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.