AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning
- URL: http://arxiv.org/abs/2411.11930v3
- Date: Fri, 13 Dec 2024 06:54:04 GMT
- Title: AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning
- Authors: Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang,
- Abstract summary: AtomThink is a framework for constructing long chains of thought (CoT) in a step-by-step manner, guiding MLLMs to perform complex reasoning.<n>AtomMATH is a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks.<n>AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50% relative accuracy gains on MathVista and 120% on MathVerse.
- Score: 70.95645743670062
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50\% relative accuracy gains on MathVista and 120\% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on https://github.com/Quinn777/AtomThink.
Related papers
- Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? [68.72260770171212]
We propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps.
Our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking.
We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs.
arXiv Detail & Related papers (2025-03-08T15:23:47Z) - Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition [2.089191490381739]
Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others.
Large Language Models (LLMs) possess only a rudimentary understanding of ToM.
We propose Decompose-ToM'': an LLM-based inference algorithm that improves model performance on complex ToM tasks.
arXiv Detail & Related papers (2025-01-15T18:44:01Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.
EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.5952731807559]
We present TACO, a family of multi-modal large action models designed to improve performance on complex, multi-step, and multi-modal tasks.
During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator.
This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.
Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.
We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM [15.687878949848182]
Large language models (LLMs) have shown limitations in tasks requiring complex logical reasoning and multi-step problem-solving.
We introduce MTMT (Multi-thinking Modes Tree), a novel method that interacts with LLMs to construct a thought tree.
We evaluate the performance of MTMT under different parameter configurations, using GPT-4o mini as the base model.
arXiv Detail & Related papers (2024-12-05T09:05:30Z) - Supervised Chain of Thought [5.389461633686935]
Chain of Thought (CoT) prompting offers a promising approach to solving complex reasoning tasks.
One-prompt-for-all approach poses significant challenges for models to generate the correct reasoning steps.
We show how task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance.
arXiv Detail & Related papers (2024-10-18T06:25:27Z) - A Notion of Complexity for Theory of Mind via Discrete World Models [2.487142846438629]
Theory of Mind (ToM) can be used to assess the capabilities of Large Language Models (LLMs) in complex scenarios where social reasoning is required.
This work proposes a framework inspired by cognitive load theory to measure the complexity of ToM tasks.
arXiv Detail & Related papers (2024-06-16T16:46:55Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - Improving Planning with Large Language Models: A Modular Agentic Architecture [7.63815864256878]
Large language models (LLMs) often struggle with tasks that require multi-step reasoning or goal-directed planning.
We propose an agentic architecture, the Modular Agentic Planner (MAP), in which planning is accomplished via the recurrent interaction of specialized modules.
We find that MAP yields significant improvements over both standard LLM methods.
arXiv Detail & Related papers (2023-09-30T00:10:14Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.