Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts
- URL: http://arxiv.org/abs/2306.07282v2
- Date: Thu, 17 Aug 2023 02:27:32 GMT
- Title: Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts
- Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia
Schmid, Zeynep Akata
- Abstract summary: WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
- Score: 121.60918966567657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The visual classification performance of vision-language models such as CLIP
has been shown to benefit from additional semantic knowledge from large
language models (LLMs) such as GPT-3. In particular, averaging over
LLM-generated class descriptors, e.g. "waffle, which has a round shape", can
notably improve generalization performance. In this work, we critically study
this behavior and propose WaffleCLIP, a framework for zero-shot visual
classification which simply replaces LLM-generated descriptors with random
character and word descriptors. Without querying external models, we achieve
comparable performance gains on a large number of visual classification tasks.
This allows WaffleCLIP to both serve as a low-cost alternative, as well as a
sanity check for any future LLM-based vision-language model extensions. We
conduct an extensive experimental study on the impact and shortcomings of
additional semantics introduced with LLM-generated descriptors, and showcase
how - if available - semantic context is better leveraged by querying LLMs for
high-level concepts, which we show can be done to jointly resolve potential
class name ambiguities. Code is available here:
https://github.com/ExplainableML/WaffleCLIP.
Related papers
- RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z) - CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones.
Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance.
We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Visual Classification via Description from Large Language Models [23.932495654407425]
Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks.
We present an alternative framework for classification with VLMs, which we call classification by description.
arXiv Detail & Related papers (2022-10-13T17:03:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.