Self-Evolving Visual Concept Library using Vision-Language Critics
- URL: http://arxiv.org/abs/2504.00185v1
- Date: Mon, 31 Mar 2025 19:47:55 GMT
- Title: Self-Evolving Visual Concept Library using Vision-Language Critics
- Authors: Atharva Sehgal, Patrick Yuan, Ziniu Hu, Yisong Yue, Jennifer J. Sun, Swarat Chaudhuri,
- Abstract summary: Building effective visual concept libraries is challenging, as manual definition is labor-intensive.<n>Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts.<n>We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks.
- Score: 38.15146001218907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.
Related papers
- VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning [19.116047583231452]
Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence.
Current LVLMs process entire images at the token level, which is inefficient compared to humans.
We propose VCM, an end-to-end self-supervised visual concept modeling framework.
arXiv Detail & Related papers (2025-04-28T09:39:07Z) - Conceptual Codebook Learning for Vision-Language Models [27.68834532978939]
We propose Codebook Learning (CoCoLe) to address the challenge of improving the generalization capability of vision-language models (VLMs)
We learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values.
We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities.
arXiv Detail & Related papers (2024-07-02T15:16:06Z) - Pre-trained Vision-Language Models Learn Discoverable Visual Concepts [33.302556000017844]
We aim to answer this question as visual concepts learned "for free" would enable wide applications.
We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts.
Our proposed concept discovery and learning framework is thus designed to identify a diverse list of generic visual concepts.
arXiv Detail & Related papers (2024-04-19T06:41:32Z) - MyVLM: Personalizing VLMs for User-Specific Queries [78.33252556805931]
We take a first step toward the personalization of vision-language models, enabling them to learn and reason over user-provided concepts.
To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model.
Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM.
This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response.
arXiv Detail & Related papers (2024-03-21T17:51:01Z) - Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks.
Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training.
This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z) - Towards Concept-Aware Large Language Models [56.48016300758356]
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication.
There is very little work on endowing machines with the ability to form and reason with concepts.
In this work, we analyze how well contemporary large language models (LLMs) capture human concepts and their structure.
arXiv Detail & Related papers (2023-11-03T12:19:22Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.