SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising
- URL: http://arxiv.org/abs/2403.04194v1
- Date: Thu, 7 Mar 2024 03:52:59 GMT
- Title: SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising
- Authors: Tao Zhou, Wenhan Luo, Qi Ye, Zhiguo Shi, Jiming Chen
- Abstract summary: We explore the potential of applying the Segment Anything Model to track and segment objects in videos.
Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame.
To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
- Score: 37.216493829454706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, promptable segmentation models, such as the Segment Anything Model
(SAM), have demonstrated robust zero-shot generalization capabilities on static
images. These promptable models exhibit denoising abilities for imprecise
prompt inputs, such as imprecise bounding boxes. In this paper, we explore the
potential of applying SAM to track and segment objects in videos where we
recognize the tracking task as a prompt denoising task. Specifically, we
iteratively propagate the bounding box of each object's mask in the preceding
frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising
capability against position and size variations, we propose a multi-prompt
strategy where we provide multiple jittered and scaled box prompts for each
object and preserve the mask prediction with the highest semantic similarity to
the template mask. We also introduce a point-based refinement stage to handle
occlusions and reduce cumulative errors. Without involving tracking modules,
our approach demonstrates comparable performance in video object/instance
segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO,
serving as a concise baseline and endowing SAM-based downstream applications
with tracking capabilities.
Related papers
- Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes [18.244508068200236]
Crowd-SAM is a framework designed to enhance SAM's performance in crowded and occluded scenes.
We introduce an efficient prompt sampler (EPS) and a part-whole discrimination network (PWD-Net) to enhance mask selection and accuracy in crowded scenes.
Crowd-SAM rivals state-of-the-art (SOTA) fully-supervised object detection methods on several benchmarks including CrowdHuman and CityPersons.
arXiv Detail & Related papers (2024-07-16T08:00:01Z) - SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention [0.0]
The Segment Anything Model (SAM) has gained notable recognition for its exceptional performance in image segmentation.
Camouflaged objects typically blend into the background, making them difficult to distinguish in still images.
We propose a new method called the SAM Spider Module (SAM-PM) to overcome these challenges.
Our method effectively incorporates temporal consistency and domain-specific expertise into the segmentation network with an addition of less than 1% of SAM's parameters.
arXiv Detail & Related papers (2024-06-09T14:33:38Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model [5.632631449489529]
Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities.
We propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker.
STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features.
arXiv Detail & Related papers (2023-05-22T03:03:29Z) - Personalize Segment Anything Model with One Shot [52.54453744941516]
We propose a training-free Personalization approach for Segment Anything Model (SAM)
Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior.
PerSAM segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement.
arXiv Detail & Related papers (2023-05-04T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.