Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training
- URL: http://arxiv.org/abs/2404.14604v3
- Date: Fri, 26 Apr 2024 02:34:29 GMT
- Title: Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training
- Authors: Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang,
- Abstract summary: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs.
MLLMs still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro.
We propose a two-step training pipeline VCAR, which emphasizes the Visual Reasoning training in addition to mathematical learning.
- Score: 24.989732666940153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands.
Related papers
- Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation [19.4261670152456]
We introduce a novel task of visual solution explanation, which requires generating explanations that incorporate newly introduced visual elements essential for understanding.
We propose MathExplain, a benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements.
Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently.
arXiv Detail & Related papers (2025-04-04T06:03:13Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.
TVC helps the model retain attention to the visual components throughout the reasoning.
Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Forgotten Polygons: Multimodal Large Language Models are Shape-Blind [36.051170815296985]
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving.
Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons.
We propose Visually Cued Chain-of-Thought prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams.
arXiv Detail & Related papers (2025-02-21T22:04:09Z) - Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.
They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.
We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.
In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.
We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Enhancing Advanced Visual Reasoning Ability of Large Language Models [20.32900494896848]
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning.
We propose Complex Visual Reasoning Large Language Models (CVR-LLM)
Our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop.
We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning.
arXiv Detail & Related papers (2024-09-21T02:10:19Z) - Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning [5.9767694994869425]
Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems.
They struggle with mathematical diagrams since they are primarily trained on natural scene images.
We propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment.
arXiv Detail & Related papers (2024-08-16T10:11:05Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context [41.11701706312843]
We design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems.
We present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes.
Our study shows that GPT-4V outperforms Gemini Pro in multi-step graph reasoning.
arXiv Detail & Related papers (2024-05-08T10:42:48Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.