SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising
- URL: http://arxiv.org/abs/2403.04194v1
- Date: Thu, 7 Mar 2024 03:52:59 GMT
- Title: SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising
- Authors: Tao Zhou, Wenhan Luo, Qi Ye, Zhiguo Shi, Jiming Chen
- Abstract summary: We explore the potential of applying the Segment Anything Model to track and segment objects in videos.
Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame.
To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
- Score: 37.216493829454706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, promptable segmentation models, such as the Segment Anything Model
(SAM), have demonstrated robust zero-shot generalization capabilities on static
images. These promptable models exhibit denoising abilities for imprecise
prompt inputs, such as imprecise bounding boxes. In this paper, we explore the
potential of applying SAM to track and segment objects in videos where we
recognize the tracking task as a prompt denoising task. Specifically, we
iteratively propagate the bounding box of each object's mask in the preceding
frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising
capability against position and size variations, we propose a multi-prompt
strategy where we provide multiple jittered and scaled box prompts for each
object and preserve the mask prediction with the highest semantic similarity to
the template mask. We also introduce a point-based refinement stage to handle
occlusions and reduce cumulative errors. Without involving tracking modules,
our approach demonstrates comparable performance in video object/instance
segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO,
serving as a concise baseline and endowing SAM-based downstream applications
with tracking capabilities.
Related papers
- SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement [40.37217744643069]
We propose a universal and efficient approach by adapting SAM to the mask refinement task.
Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM.
We extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset.
arXiv Detail & Related papers (2025-02-10T18:33:15Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes [18.244508068200236]
Crowd-SAM is a framework designed to enhance SAM's performance in crowded and occluded scenes.
We introduce an efficient prompt sampler (EPS) and a part-whole discrimination network (PWD-Net) to enhance mask selection and accuracy in crowded scenes.
Crowd-SAM rivals state-of-the-art (SOTA) fully-supervised object detection methods on several benchmarks including CrowdHuman and CityPersons.
arXiv Detail & Related papers (2024-07-16T08:00:01Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model [5.632631449489529]
Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities.
We propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker.
STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features.
arXiv Detail & Related papers (2023-05-22T03:03:29Z) - Personalize Segment Anything Model with One Shot [52.54453744941516]
We propose a training-free Personalization approach for Segment Anything Model (SAM)
Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior.
PerSAM segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement.
arXiv Detail & Related papers (2023-05-04T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.