Related papers: SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

URL: http://arxiv.org/abs/2411.17646v2
Date: Tue, 25 Mar 2025 17:17:59 GMT
Title: SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Authors: Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta,
Abstract summary: Referring Video Object (RVOS) relies on natural language expressions to segment an object in a video clip.<n>We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities.<n>We introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process.
Score: 4.166500345728911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of less than 5 M parameters. Code is available at https://github.com/ClaudiaCuttano/SAMWISE .

Related papers

VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z)
SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes [30.870903750545004]
We introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token.<n>Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2.<n>We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5% in $calmathJ&F$ on the Ref-AVS benchmark.
arXiv Detail & Related papers (2025-06-02T11:36:25Z)
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders. We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z)
A2VIS: Amodal-Aware Approach to Video Instance Segmentation [8.082593574401704]
We propose a novel framework, Amodal-Aware Video Instance (A2VIS), which incorporates amodal representations to achieve a reliable comprehensive understanding of objects in video. Amodal-Aware Video Instance (A2VIS) incorporates amodal representations to achieve a reliable comprehensive understanding of both visible and occluded parts of objects in video.
arXiv Detail & Related papers (2024-12-02T05:44:29Z)
Referring Video Object Segmentation via Language-aligned Track Selection [30.226373787454833]
Video object segmentation (RVOS) requires tracking and segmenting an object throughout a video according to a given natural language expression. We introduce SOLA, a novel framework that leverages SAM2 object tokens as compact video-level object representations. Experiments show that SOLA achieves state-of-the-art performance on the MeViS dataset.
arXiv Detail & Related papers (2024-12-02T05:20:35Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising [37.216493829454706]
We explore the potential of applying the Segment Anything Model to track and segment objects in videos. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
arXiv Detail & Related papers (2024-03-07T03:52:59Z)
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video. We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video. We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video. gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features. Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.