Related papers: MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

URL: http://arxiv.org/abs/2505.00739v1
Date: Wed, 30 Apr 2025 02:19:31 GMT
Title: MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection
Authors: Qiushi Yang, Yuan Yao, Miaomiao Cui, Liefeng Bo,
Abstract summary: We present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory.<n>MoSAM achieves state-of-the-art results compared to other competitors.
Score: 21.22536962888316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long-range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory. To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion-Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion-guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial-Temporal Memory Selection (ST-MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel- and frame-level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state-of-the-art results compared to other competitors.

Related papers

HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback [0.0]
We introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory.<n>Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video.
arXiv Detail & Related papers (2025-07-25T03:28:05Z)
SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2 [2.659882635924329]
Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks.<n>Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors and object motion.<n>We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2.
arXiv Detail & Related papers (2025-07-11T12:53:19Z)
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree [79.26409013413003]
We introduce SAM2Long, an improved training-free video object segmentation strategy.<n>It considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways.<n> SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons.
arXiv Detail & Related papers (2024-10-21T17:59:19Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models. We propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z)
Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation. Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints [9.238103649037951]
We present a framework aimed at leveraging the raw output of SAM by exploiting two novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB) Taking into account the content characteristics of SGO, we introduce the concept of object consistency to leverage segmented regions lacking semantic information. The boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object.
arXiv Detail & Related papers (2023-12-05T03:33:47Z)
Exploring Motion and Appearance Information for Temporal Sentence Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding. We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations. Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation [93.22300146395536]
We design a computational architecture that discovers camouflaged objects in videos, specifically by exploiting motion information to perform object segmentation. We collect the first large-scale Moving Camouflaged Animals (MoCA) video dataset, which consists of over 140 clips across a diverse range of animals. We demonstrate the effectiveness of the proposed model on MoCA, and achieve competitive performance on the unsupervised segmentation protocol on DAVIS2016 by only relying on motion.
arXiv Detail & Related papers (2020-11-23T18:59:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.