Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
- URL: http://arxiv.org/abs/2402.14804v1
- Date: Thu, 22 Feb 2024 18:56:38 GMT
- Title: Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
- Authors: Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li
- Abstract summary: We present the MATH-Vision dataset, a collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V.
Our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.
- Score: 33.65525875690291
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promising
results in mathematical reasoning within visual contexts, with models
approaching human-level performance on existing benchmarks such as MathVista.
However, we observe significant limitations in the diversity of questions and
breadth of subjects covered by these benchmarks. To address this issue, we
present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of
3,040 high-quality mathematical problems with visual contexts sourced from real
math competitions. Spanning 16 distinct mathematical disciplines and graded
across 5 levels of difficulty, our dataset provides a comprehensive and diverse
set of challenges for evaluating the mathematical reasoning abilities of LMMs.
Through extensive experimentation, we unveil a notable performance gap between
current LMMs and human performance on MATH-V, underscoring the imperative for
further advancements in LMMs. Moreover, our detailed categorization allows for
a thorough error analysis of LMMs, offering valuable insights to guide future
research and development. The project is available at
https://mathvision-cuhk.github.io
Related papers
- MAVIS: Mathematical Visual Instruction Tuning [64.2868278920047]
We identify three key areas within MLLMs that need to be improved.
visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills.
We propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs.
arXiv Detail & Related papers (2024-07-11T17:59:47Z) - Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models.
Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark.
We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z) - We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? [11.858791083851447]
We introduce WE-MATH, the first benchmark designed to explore the problem-solving principles beyond end-to-end performance.
We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity.
We conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance.
arXiv Detail & Related papers (2024-07-01T13:39:08Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning.
MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points.
The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z) - Large Language Models for Mathematical Reasoning: Progresses and Challenges [15.925641169201747]
Large Language Models (LLMs) are geared towards the automated resolution of mathematical problems.
This survey endeavors to address four pivotal dimensions.
It provides a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.
arXiv Detail & Related papers (2024-01-31T20:26:32Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.