Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
- URL: http://arxiv.org/abs/2411.09219v1
- Date: Thu, 14 Nov 2024 06:31:20 GMT
- Title: Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
- Authors: Yuheng Shi, Minjing Dong, Chang Xu,
- Abstract summary: We introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation.
Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code.
- Score: 26.786890883280062
- License:
- Abstract: While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.
Related papers
- Effective SAM Combination for Open-Vocabulary Semantic Segmentation [24.126307031048203]
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes.
ESC-Net is a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation.
ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context.
arXiv Detail & Related papers (2024-11-22T04:36:12Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Multi-Scale Semantic Segmentation with Modified MBConv Blocks [29.026787888644474]
We introduce a novel adaptation of MBConv blocks specifically tailored for semantic segmentation.
By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets.
arXiv Detail & Related papers (2024-02-07T07:01:08Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation [27.367986520072147]
Unsupervised semantic segmentation is a challenging task that segments images into semantic groups without manual annotation.
We propose a novel approach called SmooSeg that harnesses self-supervised learning methods to model the closeness relationships among observations as smoothness signals.
Our SmooSeg significantly outperforms STEGO in terms of pixel accuracy on three datasets.
arXiv Detail & Related papers (2023-10-27T03:29:25Z) - [CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP.
Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder.
Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z) - Hierarchical Dense Correlation Distillation for Few-Shot Segmentation [46.696051965252934]
Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations.
We design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture.
We propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation.
arXiv Detail & Related papers (2023-03-26T08:13:12Z) - Learning Open-vocabulary Semantic Segmentation Models From Natural
Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories.
We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training.
Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.