TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
- URL: http://arxiv.org/abs/2412.08176v1
- Date: Wed, 11 Dec 2024 08:07:12 GMT
- Title: TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
- Authors: Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao,
- Abstract summary: TextRefiner is a plug-and-play method to refine the text prompts of existing methods.
It builds a novel local cache module to encapsulate fine-grained visual concepts.
It achieves state-of-the-art performance and is efficient in inference.
- Score: 16.881957688535557
- License:
- Abstract: Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at https://github.com/xjjxmu/TextRefiner
Related papers
- Tree of Attributes Prompt Learning for Vision-Language Models [27.64685205305313]
We propose the Tree of Attributes Prompt learning (TAP), which generates a tree of attributes with a "concept - attribute - description" structure for each category.
Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs.
Our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts.
arXiv Detail & Related papers (2024-10-15T02:37:39Z) - The Solution for Language-Enhanced Image New Category Discovery [5.500122875523184]
We propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts.
These prompts are for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models.
We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity.
arXiv Detail & Related papers (2024-07-06T08:09:29Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - Can Better Text Semantics in Prompt Tuning Improve VLM Generalization? [28.041879000565874]
We introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models.
Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts.
Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods.
arXiv Detail & Related papers (2024-05-13T16:52:17Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [78.77544632773404]
We present a Textual-based Class-aware Prompt tuning( TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability.
TCP consistently achieves superior performance while demanding less training time.
arXiv Detail & Related papers (2023-11-30T03:59:23Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.