Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
- URL: http://arxiv.org/abs/2210.15138v1
- Date: Thu, 27 Oct 2022 02:57:26 GMT
- Title: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
- Authors: Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, Weidi Xie
- Abstract summary: Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
- Score: 39.479912987123214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When trained at a sufficient scale, self-supervised learning has exhibited a
notable ability to solve a wide range of visual or language understanding
tasks. In this paper, we investigate simple, yet effective approaches for
adapting the pre-trained foundation models to the downstream task of interest,
namely, open-vocabulary semantic segmentation. To this end, we make the
following contributions: (i) we introduce Fusioner, with a lightweight,
transformer-based fusion module, that pairs the frozen visual representation
with language concept through a handful of image segmentation data. As a
consequence, the model gains the capability of zero-shot transfer to segment
novel categories; (ii) without loss of generality, we experiment on a broad
range of self-supervised models that have been pre-trained with different
schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT),
visual-language model (CLIP), and show that, the proposed fusion approach is
effective to any pair of visual and language models, even those pre-trained on
a corpus of uni-modal data; (iii) we conduct thorough ablation studies to
analyze the critical components in our proposed Fusioner, while evaluating on
standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing
state-of-the-art models by a large margin, despite only being trained on frozen
visual and language features; (iv) to measure the model's robustness on
learning visual-language correspondence, we further evaluate on synthetic
dataset, named Mosaic-4, where images are constructed by mosaicking the samples
from FSS-1000. Fusioner demonstrates superior performance over previous models.
Related papers
- Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models? [14.582209994281374]
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
arXiv Detail & Related papers (2023-07-09T08:07:43Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.