Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt
for Segmenting Camouflaged Objects
- URL: http://arxiv.org/abs/2312.07374v3
- Date: Mon, 18 Dec 2023 20:17:55 GMT
- Title: Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt
for Segmenting Camouflaged Objects
- Authors: Jian Hu, Jiayi Lin, Weitong Cai, Shaogang Gong
- Abstract summary: We introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts.
Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches.
- Score: 32.14438610147615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camouflaged object detection (COD) approaches heavily rely on pixel-level
annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse
annotations like scribbles or points to reduce annotation effort, but this can
lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable
segmentation ability with sparse prompts like points. However, manual prompt is
not always feasible, as it may not be accessible in real-world application.
Additionally, it only provides localization information instead of semantic
one, which can intrinsically cause ambiguity in interpreting the targets. In
this work, we aim to eliminate the need for manual prompt. The key idea is to
employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts
using the semantic information given by a generic text prompt. To that end, we
introduce a test-time adaptation per-instance mechanism called Generalizable
SAM (GenSAM) to automatically enerate and optimize visual prompts the generic
task prompt for WSCOD. In particular, CCTP maps a single generic text prompt
onto image-specific consensus foreground and background heatmaps using
vision-language models, acquiring reliable visual prompts. Moreover, to
test-time adapt the visual prompts, we further propose Progressive Mask
Generation (PMG) to iteratively reweight the input image, guiding the model to
focus on the targets in a coarse-to-fine manner. Crucially, all network
parameters are fixed, avoiding the need for additional training. Experiments
demonstrate the superiority of GenSAM. Experiments on three benchmarks
demonstrate that GenSAM outperforms point supervision approaches and achieves
comparable results to scribble supervision ones, solely relying on general task
descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - PASS:Test-Time Prompting to Adapt Styles and Semantic Shapes in Medical Image Segmentation [25.419843931497965]
Test-time adaptation (TTA) has emerged as a promising paradigm to handle the domain shifts at test time for medical images.
We propose PASS (Prompting to Adapt Styles and Semantic shapes), which jointly learns two types of prompts.
We demonstrate the superior performance of PASS over state-of-the-art methods on multiple medical image segmentation datasets.
arXiv Detail & Related papers (2024-10-02T14:11:26Z) - Automating MedSAM by Learning Prompts with Weak Few-Shot Supervision [10.609371657347806]
This work proposes to replace conditioning on input prompts with a lightweight module that directly learns a prompt embedding from the image embedding.
Our approach is validated on MedSAM, a version of SAM fine-tuned for medical images.
arXiv Detail & Related papers (2024-09-30T13:53:01Z) - PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images [16.662173255725463]
We propose a novel Pointly-supervised Segment Anything Model named PointSAM.
We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10.
The results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods.
arXiv Detail & Related papers (2024-09-20T11:02:18Z) - When 3D Partial Points Meets SAM: Tooth Point Cloud Segmentation with Sparse Labels [39.54551717450374]
Tooth point cloud segmentation is a fundamental task in many orthodontic applications.
Recent weakly-supervised alternatives are proposed to use weak labels for 3D segmentation and achieve promising results.
We propose a framework named SAMTooth that leverages such capacity to complement the extremely sparse supervision.
arXiv Detail & Related papers (2024-09-03T08:14:56Z) - Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation [74.04806143723597]
We introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator.
The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image.
The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks.
arXiv Detail & Related papers (2024-08-27T17:06:22Z) - AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning [61.666973416903005]
Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts.
We propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context.
arXiv Detail & Related papers (2024-06-01T16:21:39Z) - Visual In-Context Prompting [100.93587329049848]
In this paper, we introduce a universal visual in-context prompting framework for both vision tasks like open-set segmentation and detection.
We build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points.
Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities.
arXiv Detail & Related papers (2023-11-22T18:59:48Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.