Related papers: MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

URL: http://arxiv.org/abs/2409.00147v1
Date: Fri, 30 Aug 2024 07:37:38 GMT
Title: MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
Authors: Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, Zhi Tang,
Abstract summary: We introduce textbfMultiMath-7B, a large language model that bridges the gap between math and vision. textbfMultiMath-7B is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, textbfMultiMath-300K, which spans K-12 levels with image captions and step-wise solutions.
Score: 14.274813480249161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of large language models (LLMs) has spurred extensive research into their domain-specific capabilities, particularly mathematical reasoning. However, most open-source LLMs focus solely on mathematical reasoning, neglecting the integration with visual injection, despite the fact that many mathematical tasks rely on visual inputs such as geometric diagrams, charts, and function plots. To fill this gap, we introduce \textbf{MultiMath-7B}, a multimodal large language model that bridges the gap between math and vision. \textbf{MultiMath-7B} is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, \textbf{MultiMath-300K}, which spans K-12 levels with image captions and step-wise solutions. MultiMath-7B achieves state-of-the-art (SOTA) performance among open-source models on existing multimodal mathematical benchmarks and also excels on text-only mathematical benchmarks. Our model and dataset are available at {\textcolor{blue}{\url{https://github.com/pengshuai-rin/MultiMath}}}.

Related papers

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning [58.776297011268845]
We present a comprehensive framework designed to endow unified Large Multimodal Models with intrinsic VCoT capabilities for mathematics.<n>Our model, BAGEL-canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines.<n>Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs.
arXiv Detail & Related papers (2025-10-16T17:58:58Z)
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning [6.8892368960722346]
We introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning.<n>MathMist encompasses over 21K aligned question-answer pairs across seven languages.<n>We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models.
arXiv Detail & Related papers (2025-10-16T04:59:52Z)
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z)
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [34.972503583614674]
We introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels. We observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH.
arXiv Detail & Related papers (2025-02-28T07:50:36Z)
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion. We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance. We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z)
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model. Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z)
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [37.26146689342965]
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning. MLLMs tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. We aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision.
arXiv Detail & Related papers (2024-09-10T01:20:22Z)
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning [5.9767694994869425]
Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems. They struggle with mathematical diagrams since they are primarily trained on natural scene images. We propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment.
arXiv Detail & Related papers (2024-08-16T10:11:05Z)
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5. Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z)
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training [24.989732666940153]
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs. MLLMs still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. We propose a two-step training pipeline VCAR, which emphasizes the Visual Reasoning training in addition to mathematical learning.
arXiv Detail & Related papers (2024-04-22T21:59:35Z)
InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [91.66694225955872]
We propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
arXiv Detail & Related papers (2023-09-21T17:45:42Z)
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z)
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs) We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.