Tree of Attributes Prompt Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2410.11201v1
- Date: Tue, 15 Oct 2024 02:37:39 GMT
- Title: Tree of Attributes Prompt Learning for Vision-Language Models
- Authors: Tong Ding, Wanhua Li, Zhongqi Miao, Hanspeter Pfister,
- Abstract summary: We propose the Tree of Attributes Prompt learning (TAP), which generates a tree of attributes with a "concept - attribute - description" structure for each category.
Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs.
Our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts.
- Score: 27.64685205305313
- License:
- Abstract: Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a "concept - attribute - description" structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts. Additionally, the general and diverse descriptions generated based on the class names may be wrong or absent in the specific given images. To address this misalignment, we further introduce a vision-conditional pooling module to extract instance-specific text features. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets.
Related papers
- Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)
Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z) - From Open-Vocabulary to Vocabulary-Free Semantic Segmentation [78.62232202171919]
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data.
Current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications.
This work proposes a Vocabulary-Free Semantic pipeline, eliminating the need for predefined class vocabularies.
arXiv Detail & Related papers (2025-02-17T15:17:08Z) - Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition [1.2499537119440243]
We tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) to classify objects based solely on descriptive attributes, excluding object class names.
We release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning.
We introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes.
arXiv Detail & Related papers (2024-12-18T15:28:08Z) - ATPrompt: Textual Prompt Learning with Embedded Attributes [73.1352833091256]
We introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt.
We transform the text prompt from a category-centric form to an attribute-category hybrid form.
As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format.
arXiv Detail & Related papers (2024-12-12T16:57:20Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics.
We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute
Decomposition-Aggregation [33.25304533086283]
Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time.
Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios.
This work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts.
arXiv Detail & Related papers (2023-08-31T19:34:09Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - What's in a Name? Beyond Class Indices for Image Recognition [28.02490526407716]
We propose a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information.
We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names.
Our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
arXiv Detail & Related papers (2023-04-05T11:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.