Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen
Convolutional CLIP
- URL: http://arxiv.org/abs/2308.02487v2
- Date: Tue, 14 Nov 2023 19:10:49 GMT
- Title: Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen
Convolutional CLIP
- Authors: Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
- Abstract summary: We propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone.
FC-CLIP sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets.
- Score: 28.103358632241104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary segmentation is a challenging task requiring segmenting and
recognizing objects from an open set of categories. One way to address this
challenge is to leverage multi-modal models, such as CLIP, to provide image and
text features in a shared embedding space, which bridges the gap between
closed-vocabulary and open-vocabulary recognition. Hence, existing methods
often adopt a two-stage framework to tackle the problem, where the inputs first
go through a mask generator and then through the CLIP model along with the
predicted masks. This process involves extracting features from images multiple
times, which can be ineffective and inefficient. By contrast, we propose to
build everything into a single-stage framework using a shared Frozen
Convolutional CLIP backbone, which not only significantly simplifies the
current two-stage pipeline, but also remarkably yields a better accuracy-cost
trade-off. The proposed FC-CLIP, benefits from the following observations: the
frozen CLIP backbone maintains the ability of open-vocabulary classification
and can also serve as a strong mask generator, and the convolutional CLIP
generalizes well to a larger input resolution than the one used during
contrastive image-text pretraining. When training on COCO panoptic data only
and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1
mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2
mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU
on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes,
respectively. Additionally, the training and testing time of FC-CLIP is 7.5x
and 6.6x significantly faster than the same prior art, while using 5.9x fewer
parameters. FC-CLIP also sets a new state-of-the-art performance across various
open-vocabulary semantic segmentation datasets. Code at
https://github.com/bytedance/fc-clip
Related papers
- FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval [10.26297663751352]
Few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain.
vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance.
To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP.
arXiv Detail & Related papers (2024-11-26T14:12:14Z) - Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation [79.66299178949257]
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions.
vision-language foundation models, especially CLIP, have emerged as powerful tools for acquiring open-vocabulary capabilities.
H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.
arXiv Detail & Related papers (2024-05-29T07:41:34Z) - CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation [31.264574799748903]
We propose an open-vocabulary semantic segmentation method, which does not require any annotations.
We show that the used self-supervised feature properties can directly be learnt from CLIP features.
Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference.
arXiv Detail & Related papers (2023-12-19T17:40:27Z) - Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.