Visual Classification via Description from Large Language Models
- URL: http://arxiv.org/abs/2210.07183v1
- Date: Thu, 13 Oct 2022 17:03:46 GMT
- Title: Visual Classification via Description from Large Language Models
- Authors: Sachit Menon and Carl Vondrick
- Abstract summary: Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks.
We present an alternative framework for classification with VLMs, which we call classification by description.
- Score: 23.932495654407425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) such as CLIP have shown promising performance
on a variety of recognition tasks using the standard zero-shot classification
procedure -- computing similarity between the query image and the embedded
words for each category. By only using the category name, they neglect to make
use of the rich context of additional information that language affords. The
procedure gives no intermediate understanding of why a category is chosen, and
furthermore provides no mechanism for adjusting the criteria used towards this
decision. We present an alternative framework for classification with VLMs,
which we call classification by description. We ask VLMs to check for
descriptive features rather than broad categories: to find a tiger, look for
its stripes; its claws; and more. By basing decisions on these descriptors, we
can provide additional cues that encourage using the features we want to be
used. In the process, we can get a clear idea of what features the model uses
to construct its decision; it gains some level of inherent explainability. We
query large language models (e.g., GPT-3) for these descriptors to obtain them
in a scalable way. Extensive experiments show our framework has numerous
advantages past interpretability. We show improvements in accuracy on ImageNet
across distribution shifts; demonstrate the ability to adapt VLMs to recognize
concepts unseen during training; and illustrate how descriptors can be edited
to effectively mitigate bias compared to the baseline.
Related papers
- Does VLM Classification Benefit from LLM Description Semantics? [26.743684911323857]
We propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects.
Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets.
arXiv Detail & Related papers (2024-12-16T16:01:18Z) - Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.
The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.
Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - Enhancing Visual Classification using Comparative Descriptors [13.094102298155736]
We introduce a novel concept of comparative descriptors.
These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation.
An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space.
arXiv Detail & Related papers (2024-11-08T06:28:02Z) - LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - PatchMix Augmentation to Identify Causal Features in Few-shot Learning [55.64873998196191]
Few-shot learning aims to transfer knowledge learned from base with sufficient categories labelled data to novel categories with scarce known information.
We propose a novel data augmentation strategy dubbed as PatchMix that can break this spurious dependency.
We show that such an augmentation mechanism, different from existing ones, is able to identify the causal features.
arXiv Detail & Related papers (2022-11-29T08:41:29Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.