Chain of Thought Prompt Tuning in Vision Language Models
- URL: http://arxiv.org/abs/2304.07919v2
- Date: Sat, 17 Jun 2023 06:40:27 GMT
- Title: Chain of Thought Prompt Tuning in Vision Language Models
- Authors: Jiaxin Ge, Hongyin Luo, Siyuan Qian, Yulu Gan, Jie Fu, Shanghang Zhang
- Abstract summary: We propose a novel chain of thought prompt tuning for vision-language modeling.
We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings.
- Score: 29.85907584680661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-Image Pre-training has demonstrated promising results on zero-shot
and few-shot downstream tasks by prompting visual models with natural language
prompts. However, most recent studies only use a single prompt for tuning,
neglecting the inherent step-to-step cognitive reasoning process that humans
conduct in complex task settings, for example, when processing images from
unfamiliar domains. Chain of Thought is a simple and effective approximation to
human reasoning process and has been proven useful for natural language
processing (NLP) tasks. Based on this cognitive intuition, we believe that
conducting effective reasoning is also an important problem in visual tasks,
and a chain of thought could be a solution to this problem. In this work, we
propose a novel chain of thought prompt tuning for vision-language modeling.
Extensive experiments show that our method not only generalizes better in image
classification tasks, has greater transferability beyond a single dataset, and
has stronger domain generalization performance, but also performs much better
in imagetext retrieval and visual question answering, which require more
reasoning capabilities. We are the first to successfully adapt chain-of-thought
prompting that combines visual and textual embeddings. We will release our
codes
Related papers
- Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities [30.96613796974929]
We introduce a simple method to unlock the visual reasoning capabilities of multimodal large language models.
Whiteboard-of-thought prompting provides models with a metaphorical whiteboard' to draw out reasoning steps as images.
This simple approach shows state-of-the-art results on four difficult natural language tasks.
arXiv Detail & Related papers (2024-06-20T17:59:45Z) - Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning [45.517215214938844]
Chain-of-thought technique has been received well in multi-modal tasks.
We propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning.
arXiv Detail & Related papers (2024-04-06T07:39:44Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Images Speak in Images: A Generalist Painter for In-Context Visual
Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z) - Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango [11.344587937052697]
This work initiates the preliminary steps towards a deeper understanding of reasoning mechanisms in large language models.
Our work centers around querying the model while controlling for all but one of the components in a prompt: symbols, patterns, and text.
We posit that text imbues patterns with commonsense knowledge and meaning.
arXiv Detail & Related papers (2022-09-16T02:54:00Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.