Related papers: URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

URL: http://arxiv.org/abs/2501.04686v4
Date: Mon, 24 Feb 2025 07:32:58 GMT
Title: URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Authors: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang,
Abstract summary: Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs)<n>In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
Score: 25.308196207219613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs). The introduction of process supervision for CoT trajectories has sparked discussions on improving test-time scaling, thereby unlocking the System 2-style thinking capabilities of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving both deliberate reasoning and fine-grained verification. In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning. We introduce a three-module CoT data synthesis process that integrates CoT distillation, trajectory-format rewriting, and format unification. This process generates MMathCoT-1M, a high-quality CoT reasoning instruction fine-tuning dataset. Furthermore, we implement a dual-view trajectory labeling automation that targets both visual grounding fidelity and deductive chain validity, resulting in the DualMath-1.1M dataset. The URSA-8B model, trained on MMathCoT-1M, achieves new state-of-the-art (SOTA) performance among similarly sized multimodal LLMs on six popular reasoning benchmarks. Training URSA-8B further on the DualMath-1.1M dataset yields URSA-RM-8B, a verifier that enhances URSA-8B's test-time performance and surpasses strong closed-source multimodal MLLMs like GPT-4o. The model weights, training data, and code have been open-sourced: https://github.com/URSA-MATH/URSA-MATH.

Related papers

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning [35.429904556288996]
We introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification. Experimental results show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset.
arXiv Detail & Related papers (2025-04-01T15:21:05Z)
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.5952731807559]
We present TACO, a family of multi-modal large action models designed to improve performance on complex, multi-step, and multi-modal tasks.<n>During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator.<n>This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers.
arXiv Detail & Related papers (2024-12-07T00:42:04Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.46993852662021]
We present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-09-18T16:45:37Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning [13.728595670907136]
We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH.
arXiv Detail & Related papers (2024-08-09T08:18:20Z)
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark. We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Markovian Transformers for Informative Language Modeling [0.9642500063568188]
Chain-of-Thought (CoT) reasoning holds great promise for explaining language model outputs, but recent studies have highlighted significant challenges in its practical application for interpretability.<n>We make CoT causally essential to prediction through two key components: factoring next-token prediction through intermediate CoT text, and training CoT to predict future tokens independently of other context.
arXiv Detail & Related papers (2024-04-29T17:36:58Z)
DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models [21.85879890198875]
Decentralized Iterative Merging-And-Training (DIMAT) is a novel decentralized deep learning algorithm. We show that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data, incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning, enhancing its adaptability to real-world with sparse lightweight communication computation.
arXiv Detail & Related papers (2024-04-11T18:34:29Z)
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning [110.80663974060624]
Key-Point-Driven Data Synthesis (KPDDS) is a novel data synthesis framework that synthesizes question-answer pairs. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. We present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs.
arXiv Detail & Related papers (2024-03-04T18:58:30Z)
How Do Humans Write Code? Large Models Do It the Same Way Too [14.954886191356342]
Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in Large Language Models. Using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT. We propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT.
arXiv Detail & Related papers (2024-02-24T05:40:01Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline [12.186691561822256]
We postulate that the inherent nature of large language models (LLMs) presents challenges in modeling mathematical reasoning. This paper introduces a novel math dataset, enhanced with a capability to utilize a Python code interpreter. We propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs.
arXiv Detail & Related papers (2024-01-16T08:08:01Z)
Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.