Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework
- URL: http://arxiv.org/abs/2511.20701v1
- Date: Mon, 24 Nov 2025 16:20:02 GMT
- Title: Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework
- Authors: Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal,
- Abstract summary: This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning.<n>We evaluate its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets.<n>Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types.
- Score: 1.7842332554022695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.
Related papers
- From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models [36.54062692717823]
Chain-of-Thought (CoT) reasoning has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability.<n>This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT)
arXiv Detail & Related papers (2025-11-17T01:22:37Z) - Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges [54.669838624278924]
Foundation models have transformed natural language processing and computer vision.<n>With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data.<n>This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective.
arXiv Detail & Related papers (2025-10-27T03:40:00Z) - Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models [0.0]
This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models.<n>We find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors.
arXiv Detail & Related papers (2025-08-06T13:14:20Z) - Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models [79.52467430114805]
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains.<n>In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior.<n>Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities.
arXiv Detail & Related papers (2025-05-08T03:35:23Z) - Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey [124.23247710880008]
multimodal CoT (MCoT) reasoning has recently garnered significant research attention.<n>Existing MCoT studies design various methodologies to address the challenges of image, video, speech, audio, 3D, and structured data.<n>We present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions.
arXiv Detail & Related papers (2025-03-16T18:39:13Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Multimodal Alignment and Fusion: A Survey [11.3029945633295]
This survey provides a comprehensive overview of advances in multimodal alignment and fusion within the field of machine learning.<n>We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives.<n>This survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap.
arXiv Detail & Related papers (2024-11-26T02:10:27Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence.
We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions.
We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.