Related papers: Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

URL: http://arxiv.org/abs/2409.19846v1
Date: Mon, 30 Sep 2024 01:13:03 GMT
Title: Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Authors: Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim,
Abstract summary: We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
Score: 53.8817160001038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation. Project page is available at https://cvlab-kaist.github.io/PixelCLIP

Related papers

FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation [63.31007867379312]
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information. We propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation.
arXiv Detail & Related papers (2025-01-01T15:47:04Z)
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding. We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z)
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z)
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg. AdaB Decoder is designed to generate different image embeddings for both training and new classes. SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z)
Subobject-level Image Tokenization [60.80949852899857]
Transformer-based vision models typically tokenize images into fixed-size square patches as input units. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment. We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z)
Learning Semantic Segmentation with Query Points Supervision on Aerial Images [57.09251327650334]
We present a weakly supervised learning algorithm to train semantic segmentation algorithms. Our proposed approach performs accurate semantic segmentation and improves efficiency by significantly reducing the cost and time required for manual annotation.
arXiv Detail & Related papers (2023-09-11T14:32:04Z)
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning. It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z)
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation. The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z)
Maximize the Exploration of Congeneric Semantics for Weakly Supervised Semantic Segmentation [27.155133686127474]
We construct a graph neural network (P-GNN) based on the self-detected patches from different images that contain the same class labels. We conduct experiments on the popular PASCAL VOC 2012 benchmarks, and our model yields state-of-the-art performance.
arXiv Detail & Related papers (2021-10-08T08:59:16Z)
Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z)
Automatic Image Labelling at Pixel Level [21.59653873040243]
We propose an interesting learning approach to generate pixel-level image labellings automatically. A Guided Filter Network (GFN) is first developed to learn the segmentation knowledge from a source domain. GFN then transfers such segmentation knowledge to generate coarse object masks in the target domain.
arXiv Detail & Related papers (2020-07-15T00:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.