Related papers: Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

URL: http://arxiv.org/abs/2302.00923v5
Date: Mon, 20 May 2024 06:43:48 GMT
Title: Multimodal Chain-of-Thought Reasoning in Language Models
Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola,
Abstract summary: We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
Score: 94.70184390935661
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Related papers

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision [20.953299102154215]
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs)<n>We propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning.
arXiv Detail & Related papers (2025-08-07T17:45:17Z)
Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation [22.722731231389393]
Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) Large Language Models (LLMs)<n>We propose a Multimodal Chain-of-Student Reasoning Distillation model, MulCoT-RD, to address deployment constraints in resource-limited environments.<n>Experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
arXiv Detail & Related papers (2025-08-07T10:23:14Z)
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z)
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO [87.52631406241456]
Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks.<n>We introduce Mind Omni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning.
arXiv Detail & Related papers (2025-05-19T12:17:04Z)
Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [20.562109430526007]
Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-hugging learning by embedding structured reasoning into alignment training.
arXiv Detail & Related papers (2025-03-08T14:24:54Z)
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [25.308196207219613]
Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs) In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-01-08T18:49:41Z)
Vision-Language Models Can Self-Improve Reasoning via Reflection [20.196406628954303]
Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs) We propose a self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales. Our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.
arXiv Detail & Related papers (2024-10-30T14:45:00Z)
Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning) The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs. It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns [26.641713417293538]
Chain of Thought (CoT) prompting can encourage language models to engage in logical reasoning. We propose leveraging reasoning patterns to enhance CoT prompting effectiveness.
arXiv Detail & Related papers (2024-04-23T07:50:00Z)
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs) Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z)
Chain-of-Thought Prompt Distillation for Multimodal Named Entity Recognition and Multimodal Relation Extraction [8.169359626365619]
We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps. We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models. Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2023-06-25T04:33:56Z)
Automatic Model Selection with Large Language Models for Reasoning [33.93807127935167]
Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two distinct reasoning methods. We introduce a model selection method to combine the best of both worlds by employing a large language model. Our proposed method demonstrates significant performance improvements across eight reasoning datasets.
arXiv Detail & Related papers (2023-05-23T17:57:59Z)
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering [59.63860993280275]
Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. We propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. Our approach achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%.
arXiv Detail & Related papers (2023-05-05T11:56:30Z)
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.