Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2405.13872v2
- Date: Wed, 29 May 2024 02:24:36 GMT
- Title: Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
- Authors: Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang,
- Abstract summary: Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks.
We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step.
IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs.
- Score: 14.765057045747753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. With the evolution of Multimodal Large Language Models (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT has yet to be thoroughly investigated. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. Specifically, IoT prompting can automatically design critical visual information extraction operations based on the input images and questions. Each step of visual information refinement identifies specific visual rationales that support answers to complex visual reasoning questions. Beyond the textual CoT, IoT simultaneously utilizes visual and textual rationales to help MLLMs understand complex multimodal information. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs. Moreover, the step-by-step visual feature explanations generated by IoT prompting elucidate the visual reasoning process, aiding in analyzing the cognitive processes of large multimodal models
Related papers
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models [28.072184039405784]
We present POEM, a visual analytics system to facilitate efficient prompt engineering for large language models (LLMs)
The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts.
arXiv Detail & Related papers (2024-06-06T08:21:30Z) - MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning [3.651416979200174]
MMCTAgent is a novel critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks.
Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning.
arXiv Detail & Related papers (2024-05-28T16:55:41Z) - Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models [42.182009352159]
We present a new efficient LLVM, Mamba-based traversal of rationales (Meteor)
To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity.
Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale.
arXiv Detail & Related papers (2024-05-24T14:04:03Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in
Language Models [25.058162782167503]
Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions.
We introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts.
Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT.
arXiv Detail & Related papers (2023-12-14T09:13:09Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.