CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection
- URL: http://arxiv.org/abs/2504.00375v1
- Date: Tue, 01 Apr 2025 02:45:17 GMT
- Title: CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection
- Authors: Xin Zhang, Keren Fu, Qijun Zhao,
- Abstract summary: The application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation.<n>We propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2.<n>Our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric.
- Score: 14.219232629274186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.
Related papers
- CamSAM2: Segment Anything Accurately in Camouflaged Videos [37.0152845263844]
We propose Camouflaged SAM2 (CamSAM2) to handle camouflaged scenes without modifying SAM2's parameters.<n>To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules.<n>While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets.
arXiv Detail & Related papers (2025-03-25T14:58:52Z) - COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection [42.23374375190698]
We propose a novel multiprompt network called COMPrompter for camouflaged object detection (COD)<n>Our network aims to enhance the single prompt strategy in SAM to a multiprompt strategy.<n>We employ the discrete wavelet transform to extract high-frequency features from image embeddings.
arXiv Detail & Related papers (2024-11-28T01:58:28Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Explicit Motion Handling and Interactive Prompting for Video Camouflaged
Object Detection [23.059829327898818]
Existing video camouflaged object detection approaches take noisy motion estimation as input or model motion implicitly.
We propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly.
EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation.
arXiv Detail & Related papers (2024-03-04T12:11:07Z) - SAM-based instance segmentation models for the automation of structural
damage detection [0.0]
We present a data set for instance segmentation with 1,300 annotated images (640 pixels x 640 pixels), named as M1300, covering bricks, broken bricks, and cracks.
We test several leading algorithms for benchmarking, including the latest large-scale model, the prompt-based Segment Anything Model (SAM)
We propose two novel methods for automation of SAM execution. The first method involves abandoning the prompt encoder and connecting the SAM encoder to other decoders, while the second method introduces a learnable self-generating prompter.
arXiv Detail & Related papers (2024-01-27T02:00:07Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - CamoFormer: Masked Separable Attention for Camouflaged Object Detection [94.2870722866853]
We present a simple masked separable attention (MSA) for camouflaged object detection.
We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies.
We propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results.
arXiv Detail & Related papers (2022-12-10T10:03:27Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.