ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
- URL: http://arxiv.org/abs/2401.12665v2
- Date: Mon, 29 Jan 2024 10:57:38 GMT
- Title: ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
- Authors: Shengze Li, Jianjian Cao, Peng Ye, Yuhan Ding, Chongjun Tu, Tao Chen
- Abstract summary: We propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS.
The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation.
In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting with visual features.
- Score: 5.376142948115328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, foundational models such as CLIP and SAM have shown promising
performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,
either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible
key drawbacks: 1) CLIP primarily focuses on global feature alignment across
different inputs, leading to imprecise segmentation of local anomalous parts;
2) SAM tends to generate numerous redundant masks without proper prompt
constraints, resulting in complex post-processing requirements. In this work,
we innovatively propose a CLIP and SAM collaboration framework called ClipSAM
for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding
capability for anomaly localization and rough segmentation, which is further
used as the prompt constraints for SAM to refine the anomaly segmentation
results. In details, we introduce a crucial Unified Multi-scale Cross-modal
Interaction (UMCI) module for interacting language with visual features at
multiple scales of CLIP to reason anomaly positions. Then, we design a novel
Multi-level Mask Refinement (MMR) module, which utilizes the positional
information as multi-level prompts for SAM to acquire hierarchical levels of
masks and merges them. Extensive experiments validate the effectiveness of our
approach, achieving the optimal segmentation performance on the MVTec-AD and
VisA datasets.
Related papers
- Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges.
This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation.
Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z) - AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning [61.666973416903005]
Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts.
We propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context.
arXiv Detail & Related papers (2024-06-01T16:21:39Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - WSI-SAM: Multi-resolution Segment Anything Model (SAM) for histopathology whole-slide images [8.179859593451285]
We present WSI-SAM, enhancing Segment Anything Model (SAM) with precise object segmentation capabilities for histopathology images.
To fully exploit pretrained knowledge while minimizing training overhead, we keep SAM frozen, introducing only minimal extra parameters.
Our model outperforms SAM by 4.1 and 2.5 percent points on a ductal carcinoma in situ (DCIS) segmentation tasks and breast cancer metastasis segmentation task.
arXiv Detail & Related papers (2024-03-14T10:30:43Z) - BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning
of SAM [37.1263294647351]
We introduce BLO-SAM, which finetunes the Segment Anything Model (SAM) based on bi-level optimization (BLO)
BLO-SAM reduces the risk of overfitting by training the model's weight parameters and the prompt embedding on two separate subsets of the training dataset.
Results demonstrate BLO-SAM's superior performance over various state-of-the-art image semantic segmentation methods.
arXiv Detail & Related papers (2024-02-26T06:36:32Z) - Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively [69.97238935096094]
The Open-Vocabulary SAM is a SAM-inspired model designed for simultaneous interactive segmentation and recognition.
Our method can segment and recognize approximately 22,000 classes.
arXiv Detail & Related papers (2024-01-05T18:59:22Z) - Stable Segment Anything Model [79.9005670886038]
The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts.
This paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities.
Our solution, termed Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality.
arXiv Detail & Related papers (2023-11-27T12:51:42Z) - SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding [40.40630116715132]
The landscape of publicly available vision foundation models (VFMs) is expanding rapidly.
We introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise.
By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer.
arXiv Detail & Related papers (2023-10-23T19:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.