Multimodal Chain-of-Thought Reasoning in Language Models
- URL: http://arxiv.org/abs/2302.00923v5
- Date: Mon, 20 May 2024 06:43:48 GMT
- Title: Multimodal Chain-of-Thought Reasoning in Language Models
- Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola,
- Abstract summary: We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
- Score: 94.70184390935661
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.
Related papers
- Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision [20.953299102154215]
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs)<n>We propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning.
arXiv Detail & Related papers (2025-08-07T17:45:17Z) - Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation [22.722731231389393]
Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) Large Language Models (LLMs)<n>We propose a Multimodal Chain-of-Student Reasoning Distillation model, MulCoT-RD, to address deployment constraints in resource-limited environments.<n>Experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
arXiv Detail & Related papers (2025-08-07T10:23:14Z) - Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z) - MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO [87.52631406241456]
Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks.<n>We introduce Mind Omni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning.
arXiv Detail & Related papers (2025-05-19T12:17:04Z) - Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [20.562109430526007]
Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment.
This study investigates its integration into 3D vision-hugging learning by embedding structured reasoning into alignment training.
arXiv Detail & Related papers (2025-03-08T14:24:54Z) - URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [25.308196207219613]
Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs)
In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-01-08T18:49:41Z) - Vision-Language Models Can Self-Improve Reasoning via Reflection [20.196406628954303]
Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs)
We propose a self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales.
Our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.
arXiv Detail & Related papers (2024-10-30T14:45:00Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns [26.641713417293538]
Chain of Thought (CoT) prompting can encourage language models to engage in logical reasoning.
We propose leveraging reasoning patterns to enhance CoT prompting effectiveness.
arXiv Detail & Related papers (2024-04-23T07:50:00Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Chain-of-Thought Prompt Distillation for Multimodal Named Entity
Recognition and Multimodal Relation Extraction [8.169359626365619]
We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps.
We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models.
Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2023-06-25T04:33:56Z) - Automatic Model Selection with Large Language Models for Reasoning [33.93807127935167]
Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two distinct reasoning methods.
We introduce a model selection method to combine the best of both worlds by employing a large language model.
Our proposed method demonstrates significant performance improvements across eight reasoning datasets.
arXiv Detail & Related papers (2023-05-23T17:57:59Z) - T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
Language Model Signals for Science Question Answering [59.63860993280275]
Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks.
We propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals.
Our approach achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%.
arXiv Detail & Related papers (2023-05-05T11:56:30Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.