MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2506.05331v1
- Date: Thu, 05 Jun 2025 17:59:02 GMT
- Title: MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
- Authors: Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li,
- Abstract summary: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs)<n>We propose Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning.
- Score: 43.525708427464544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
Related papers
- Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [105.25503508433758]
We introduce $textbfZebra-CoT$, a diverse large-scale dataset with 182,384 samples.<n>We focus on four categories of tasks where sketching or visual reasoning is especially natural.<n>Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains.
arXiv Detail & Related papers (2025-07-22T16:35:36Z) - MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning [36.55610944179401]
We propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures.<n>Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach.<n>We present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving.
arXiv Detail & Related papers (2025-05-15T17:59:21Z) - Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization [69.29207684569695]
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs)<n>Existing approaches are focused on text CoT, limiting their ability to leverage visual cues.<n>In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization.
arXiv Detail & Related papers (2025-04-25T14:48:18Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 78% with minimal accuracy loss across 15 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models [14.274813480249161]
We introduce textbfMultiMath-7B, a large language model that bridges the gap between math and vision.
textbfMultiMath-7B is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning.
We also construct a novel, diverse and comprehensive multimodal mathematical dataset, textbfMultiMath-300K, which spans K-12 levels with image captions and step-wise solutions.
arXiv Detail & Related papers (2024-08-30T07:37:38Z) - MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine [85.80851893886161]
We propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets.
We use MAVIS-Caption to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding.
Third, we adopt MAVIS-Instruct to perform the instruction tuning for robust problem-solving skills, and term the resulting model as MAVIS-7B.
arXiv Detail & Related papers (2024-07-11T17:59:47Z) - Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning [31.110005898556892]
Large Language Models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning.
We propose CoT-Influx, a novel approach that pushes the boundary of few-shot Chain-of-Thoughts (CoT) learning.
CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples.
arXiv Detail & Related papers (2023-12-14T13:03:13Z) - Learnable Graph Matching: A Practical Paradigm for Data Association [74.28753343714858]
We propose a general learnable graph matching method to address these issues.
Our method achieves state-of-the-art performance on several MOT datasets.
For image matching, our method outperforms state-of-the-art methods on a popular indoor dataset, ScanNet.
arXiv Detail & Related papers (2023-03-27T17:39:00Z) - Program of Thoughts Prompting: Disentangling Computation from Reasoning
for Numerical Reasoning Tasks [108.4568236569645]
Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks.
We propose Program of Thoughts' (PoT), which uses language models to express the reasoning process as a program.
PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets.
arXiv Detail & Related papers (2022-11-22T21:06:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.