Related papers: CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

URL: http://arxiv.org/abs/2312.07661v3
Date: Tue, 7 May 2024 12:00:34 GMT
Title: CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
Authors: Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li,
Abstract summary: Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
Score: 18.288738950822342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

Related papers

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder [5.57393627015653]
Video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models.<n>This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy.<n>We propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2.
arXiv Detail & Related papers (2025-06-28T13:30:36Z)
Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks. mask classification is the main performance bottleneck for open-vocab panoptic segmentation. We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation [54.98688607911399]
We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs) Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries. Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
arXiv Detail & Related papers (2024-05-30T15:16:06Z)
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework. It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z)
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models [44.146292819267956]
Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
arXiv Detail & Related papers (2023-11-28T06:42:58Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z)
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [45.81698881151867]
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
arXiv Detail & Related papers (2022-10-09T02:57:32Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.