Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models
- URL: http://arxiv.org/abs/2404.10357v2
- Date: Wed, 17 Apr 2024 02:48:49 GMT
- Title: Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models
- Authors: Enming Zhang, Bingke Zhu, Yingying Chen, Qinghai Miao, Ming Tang, Jinqiao Wang,
- Abstract summary: CoKnow is a framework that enhances Prompt Learning for Vision-Language Models with rich contextual knowledge.
We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.
- Score: 26.964848679914354
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow.
Related papers
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering [44.008094698200026]
We introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources.
Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge.
This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.
arXiv Detail & Related papers (2024-11-25T19:01:03Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - LiPost: Improved Content Understanding With Effective Use of Multi-task Contrastive Learning [2.611731148829789]
We fine-tune a pre-trained, transformer-based LLM using multi-task contrastive learning with data from a diverse set of semantic labeling tasks.
Our model outperforms the baseline on zero shot learning and offers improved multilingual support.
This work provides a robust foundation for vertical teams across LinkedIn to customize and fine-tune the LLM to their specific applications.
arXiv Detail & Related papers (2024-05-18T17:28:29Z) - Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment [15.180715595425864]
We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs)
With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling.
Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization.
arXiv Detail & Related papers (2023-09-08T06:51:15Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - Towards Versatile and Efficient Visual Knowledge Integration into
Pre-trained Language Models with Cross-Modal Adapters [16.44174900423759]
We propose a new plug-and-play module, X-adapter, to leverage the aligned visual and textual knowledge learned in pre-trained vision-language models.
Our method can significantly improve the performance on object-color reasoning and natural language understanding tasks.
arXiv Detail & Related papers (2023-05-12T10:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.