Related papers: Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

URL: http://arxiv.org/abs/2408.08640v2
Date: Wed, 25 Sep 2024 09:53:13 GMT
Title: Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Authors: Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng,
Abstract summary: Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems. They struggle with mathematical diagrams since they are primarily trained on natural scene images. We propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment.
Score: 5.9767694994869425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. The code and data are available at: \url{https://github.com/wwzhuang01/Math-PUMA}.

Related papers

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z)
MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning [57.42710816140401]
A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps.<n>Existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored.<n>This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.
arXiv Detail & Related papers (2025-07-24T07:03:11Z)
MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z)
MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection [53.325457460187046]
We introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. MathAgent decomposes error detection into three phases, each handled by a specialized agent. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification.
arXiv Detail & Related papers (2025-03-23T16:25:08Z)
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [34.972503583614674]
We introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels. We observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH.
arXiv Detail & Related papers (2025-02-28T07:50:36Z)
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [37.26146689342965]
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning. MLLMs tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. We aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision.
arXiv Detail & Related papers (2024-09-10T01:20:22Z)
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models [14.274813480249161]
We introduce textbfMultiMath-7B, a large language model that bridges the gap between math and vision. textbfMultiMath-7B is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, textbfMultiMath-300K, which spans K-12 levels with image captions and step-wise solutions.
arXiv Detail & Related papers (2024-08-30T07:37:38Z)
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark [29.9945601202065]
We propose MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models.
arXiv Detail & Related papers (2024-08-14T13:23:43Z)
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine [85.80851893886161]
We propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets. We use MAVIS-Caption to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Third, we adopt MAVIS-Instruct to perform the instruction tuning for robust problem-solving skills, and term the resulting model as MAVIS-7B.
arXiv Detail & Related papers (2024-07-11T17:59:47Z)
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5. Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z)
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios. We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training [24.989732666940153]
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs. MLLMs still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. We propose a two-step training pipeline VCAR, which emphasizes the Visual Reasoning training in addition to mathematical learning.
arXiv Detail & Related papers (2024-04-22T21:59:35Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs) We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.