Related papers: Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

URL: http://arxiv.org/abs/2306.07282v2
Date: Thu, 17 Aug 2023 02:27:32 GMT
Title: Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata
Abstract summary: WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
Score: 121.60918966567657
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose WaffleCLIP, a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. Without querying external models, we achieve comparable performance gains on a large number of visual classification tasks. This allows WaffleCLIP to both serve as a low-cost alternative, as well as a sanity check for any future LLM-based vision-language model extensions. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors, and showcase how - if available - semantic context is better leveraged by querying LLMs for high-level concepts, which we show can be done to jointly resolve potential class name ambiguities. Code is available here: https://github.com/ExplainableML/WaffleCLIP.

Related papers

Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID.
arXiv Detail & Related papers (2025-01-01T15:58:51Z)
Does VLM Classification Benefit from LLM Description Semantics? [26.743684911323857]
We propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets.
arXiv Detail & Related papers (2024-12-16T16:01:18Z)
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment. We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z)
Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z)
CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z)
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones. Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance. We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
Visual Classification via Description from Large Language Models [23.932495654407425]
Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks. We present an alternative framework for classification with VLMs, which we call classification by description.
arXiv Detail & Related papers (2022-10-13T17:03:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.