Related papers: Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification

Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification

URL: http://arxiv.org/abs/2311.07593v2
Date: Fri, 15 Mar 2024 08:58:05 GMT
Title: Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
Authors: Reza Esfandiarpoor, Stephen H. Bach,
Abstract summary: Follow-up Differential Descriptions (FuDD) is a zero-shot approach that tailors the class descriptions to each dataset. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. We show that FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets.
Score: 8.663915072332834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A promising approach for improving the performance of vision-language models like CLIP for image classification is to extend the class descriptions (i.e., prompts) with related attributes, e.g., using brown sparrow instead of sparrow. However, current zero-shot methods select a subset of attributes regardless of commonalities between the target classes, potentially providing no useful information that would have helped to distinguish between them. For instance, they may use color instead of bill shape to distinguish between sparrows and wrens, which are both brown. We propose Follow-up Differential Descriptions (FuDD), a zero-shot approach that tailors the class descriptions to each dataset and leads to additional attributes that better differentiate the target classes. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. The new class descriptions resolve the initial ambiguity and help predict the correct label. In our experiments, FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. We show that differential descriptions are an effective tool to resolve class ambiguities, which otherwise significantly degrade the performance. We also show that high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.

Related papers

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability [54.420663939897686]
We propose the Attribute-formed Language Bottleneck Model (ALBM) to achieve interpretable image recognition. ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. To further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes.
arXiv Detail & Related papers (2025-03-26T07:59:04Z)
Does VLM Classification Benefit from LLM Description Semantics? [26.743684911323857]
We propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets.
arXiv Detail & Related papers (2024-12-16T16:01:18Z)
Enhancing Visual Classification using Comparative Descriptors [13.094102298155736]
We introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space.
arXiv Detail & Related papers (2024-11-08T06:28:02Z)
Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute. We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z)
Few-shot Learner Parameterization by Diffusion Time-steps [133.98320335394004]
Few-shot learning is still challenging when using large multi-modal foundation models. We propose Time-step Few-shot (TiF) learner to make up for lost attributes. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks.
arXiv Detail & Related papers (2024-03-05T04:38:13Z)
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions. We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z)
Visual Classification via Description from Large Language Models [23.932495654407425]
Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks. We present an alternative framework for classification with VLMs, which we call classification by description.
arXiv Detail & Related papers (2022-10-13T17:03:46Z)
What does a platypus look like? Generating customized prompts for zero-shot image classification [52.92839995002636]
This work introduces a simple method to generate higher accuracy prompts without relying on any explicit knowledge of the task domain. We leverage the knowledge contained in large language models (LLMs) to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet.
arXiv Detail & Related papers (2022-09-07T17:27:08Z)
Attribute Group Editing for Reliable Few-shot Image Generation [85.52840521454411]
We propose a new editing-based method, i.e., Attribute Group Editing (AGE), for few-shot image generation. AGE examines the internal representation learned in GANs and identifies semantically meaningful directions.
arXiv Detail & Related papers (2022-03-16T06:54:09Z)
Attribute Propagation Network for Graph Zero-shot Learning [57.68486382473194]
We introduce the attribute propagation network (APNet), which is composed of 1) a graph propagation model generating attribute vector for each class and 2) a parameterized nearest neighbor (NN) classifier. APNet achieves either compelling performance or new state-of-the-art results in experiments with two zero-shot learning settings and five benchmark datasets.
arXiv Detail & Related papers (2020-09-24T16:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.