Visual Classification via Description from Large Language Models
- URL: http://arxiv.org/abs/2210.07183v1
- Date: Thu, 13 Oct 2022 17:03:46 GMT
- Title: Visual Classification via Description from Large Language Models
- Authors: Sachit Menon and Carl Vondrick
- Abstract summary: Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks.
We present an alternative framework for classification with VLMs, which we call classification by description.
- Score: 23.932495654407425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) such as CLIP have shown promising performance
on a variety of recognition tasks using the standard zero-shot classification
procedure -- computing similarity between the query image and the embedded
words for each category. By only using the category name, they neglect to make
use of the rich context of additional information that language affords. The
procedure gives no intermediate understanding of why a category is chosen, and
furthermore provides no mechanism for adjusting the criteria used towards this
decision. We present an alternative framework for classification with VLMs,
which we call classification by description. We ask VLMs to check for
descriptive features rather than broad categories: to find a tiger, look for
its stripes; its claws; and more. By basing decisions on these descriptors, we
can provide additional cues that encourage using the features we want to be
used. In the process, we can get a clear idea of what features the model uses
to construct its decision; it gains some level of inherent explainability. We
query large language models (e.g., GPT-3) for these descriptors to obtain them
in a scalable way. Extensive experiments show our framework has numerous
advantages past interpretability. We show improvements in accuracy on ImageNet
across distribution shifts; demonstrate the ability to adapt VLMs to recognize
concepts unseen during training; and illustrate how descriptors can be edited
to effectively mitigate bias compared to the baseline.
Related papers
- LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z) - Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification [8.663915072332834]
Follow-up Differential Descriptions (FuDD) is a zero-shot approach that tailors the class descriptions to each dataset.
FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them.
We show that FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets.
arXiv Detail & Related papers (2023-11-10T05:24:07Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - PatchMix Augmentation to Identify Causal Features in Few-shot Learning [55.64873998196191]
Few-shot learning aims to transfer knowledge learned from base with sufficient categories labelled data to novel categories with scarce known information.
We propose a novel data augmentation strategy dubbed as PatchMix that can break this spurious dependency.
We show that such an augmentation mechanism, different from existing ones, is able to identify the causal features.
arXiv Detail & Related papers (2022-11-29T08:41:29Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.