LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions
- URL: http://arxiv.org/abs/2311.11904v2
- Date: Mon, 19 Feb 2024 09:24:44 GMT
- Title: LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions
- Authors: Songhao Han, Le Zhuo, Yue Liao, Si Liu
- Abstract summary: We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
- Score: 13.546494268784757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) offer a promising paradigm for image
classification by comparing the similarity between images and class embeddings.
A critical challenge lies in crafting precise textual representations for class
names. While previous studies have leveraged recent advancements in large
language models (LLMs) to enhance these descriptors, their outputs often suffer
from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the
reliance on single-turn textual interactions with LLMs, leading to a mismatch
between generated text and visual concepts for VLMs; 2) the oversight of the
inter-class relationships, resulting in descriptors that fail to differentiate
similar classes effectively. In this paper, we propose a novel framework that
integrates LLMs and VLMs to find the optimal class descriptors. Our
training-free approach develops an LLM-based agent with an evolutionary
optimization strategy to iteratively refine class descriptors. We demonstrate
our optimized descriptors are of high quality which effectively improves
classification accuracy on a wide range of benchmarks. Additionally, these
descriptors offer explainable and robust features, boosting performance across
various backbone models and complementing fine-tuning-based methods.
Related papers
- SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images.
We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z) - Towards Generative Class Prompt Learning for Fine-grained Visual Recognition [5.633314115420456]
Generative Class Prompt Learning and Contrastive Multi-class Prompt Learning are presented.
Generative Class Prompt Learning improves visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts.
CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation.
arXiv Detail & Related papers (2024-09-03T12:34:21Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - Visual Classification via Description from Large Language Models [23.932495654407425]
Vision-language models (VLMs) have shown promising performance on a variety of recognition tasks.
We present an alternative framework for classification with VLMs, which we call classification by description.
arXiv Detail & Related papers (2022-10-13T17:03:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.