Related papers: Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

URL: http://arxiv.org/abs/2411.09219v1
Date: Thu, 14 Nov 2024 06:31:20 GMT
Title: Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Authors: Yuheng Shi, Minjing Dong, Chang Xu,
Abstract summary: We introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code.
Score: 26.786890883280062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Related papers

CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation [21.466035540502226]
Few-shot segmentation (FSS) aims to segment new classes using few annotated images. Recent FSS methods have shown considerable improvements by leveraging Segment Anything Model (SAM) We propose CMaP-SAM, a novel framework that introduces contraction mapping theory to optimize position priors for SAM-driven FSS.
arXiv Detail & Related papers (2025-04-07T13:19:16Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Effective SAM Combination for Open-Vocabulary Semantic Segmentation [24.126307031048203]
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. ESC-Net is a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context.
arXiv Detail & Related papers (2024-11-22T04:36:12Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction. A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z)
PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework. We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z)
Multi-Scale Semantic Segmentation with Modified MBConv Blocks [29.026787888644474]
We introduce a novel adaptation of MBConv blocks specifically tailored for semantic segmentation. By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets.
arXiv Detail & Related papers (2024-02-07T07:01:08Z)
Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z)
SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation [27.367986520072147]
Unsupervised semantic segmentation is a challenging task that segments images into semantic groups without manual annotation. We propose a novel approach called SmooSeg that harnesses self-supervised learning methods to model the closeness relationships among observations as smoothness signals. Our SmooSeg significantly outperforms STEGO in terms of pixel accuracy on three datasets.
arXiv Detail & Related papers (2023-10-27T03:29:25Z)
[CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder. Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z)
Hierarchical Dense Correlation Distillation for Few-Shot Segmentation [46.696051965252934]
Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. We design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. We propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation.
arXiv Detail & Related papers (2023-03-26T08:13:12Z)
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.