Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT
- URL: http://arxiv.org/abs/2305.00201v1
- Date: Sat, 29 Apr 2023 08:59:12 GMT
- Title: Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT
- Authors: Zhenxiang Xiao, Yuzhong Chen, Lu Zhang, Junjie Yao, Zihao Wu, Xiaowei
Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, Wei Liu, Xiang Li, Yixuan Yuan,
Dinggang Shen, Dajiang Zhu, Tianming Liu, Xi Jiang
- Abstract summary: In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
- Score: 58.70209492842953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompts have been proven to play a crucial role in large language models, and
in recent years, vision models have also been using prompts to improve
scalability for multiple downstream tasks. In this paper, we focus on adapting
prompt design based on instruction tuning into a visual transformer model for
image classification which we called Instruction-ViT. The key idea is to
implement multi-modal prompts (text or image prompt) related to category
information to guide the fine-tuning of the model. Based on the experiments of
several image captionining tasks, the performance and domain adaptability were
improved. Our work provided an innovative strategy to fuse multi-modal prompts
with better performance and faster adaptability for visual classification
models.
Related papers
- Multi-Modal Adapter for Vision-Language Models [5.040884755454258]
We propose Multi-Modal Adapter, an approach for Multi-Modal adaptation of CLIP.
We add a trainable Multi-Head Attention layer that combines text and image features to produce an additive adaptation of both.
arXiv Detail & Related papers (2024-09-03T12:47:08Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models [12.397136690734865]
We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT.
MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion.
Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
arXiv Detail & Related papers (2023-06-20T09:15:52Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - Visual Prompt Tuning for Generative Transfer Learning [26.895321693202284]
We present a recipe for learning vision transformers by generative knowledge transfer.
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence.
arXiv Detail & Related papers (2022-10-03T14:56:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.