Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
- URL: http://arxiv.org/abs/2402.14804v1
- Date: Thu, 22 Feb 2024 18:56:38 GMT
- Title: Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
- Authors: Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li
- Abstract summary: We present the MATH-Vision dataset, a collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V.
Our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.
- Score: 33.65525875690291
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promising
results in mathematical reasoning within visual contexts, with models
approaching human-level performance on existing benchmarks such as MathVista.
However, we observe significant limitations in the diversity of questions and
breadth of subjects covered by these benchmarks. To address this issue, we
present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of
3,040 high-quality mathematical problems with visual contexts sourced from real
math competitions. Spanning 16 distinct mathematical disciplines and graded
across 5 levels of difficulty, our dataset provides a comprehensive and diverse
set of challenges for evaluating the mathematical reasoning abilities of LMMs.
Through extensive experimentation, we unveil a notable performance gap between
current LMMs and human performance on MATH-V, underscoring the imperative for
further advancements in LMMs. Moreover, our detailed categorization allows for
a thorough error analysis of LMMs, offering valuable insights to guide future
research and development. The project is available at
https://mathvision-cuhk.github.io
Related papers
- VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning [47.81337826098964]
We present VisAidMath, a benchmark for evaluating the problem-solving process related to visual information.
This benchmark includes 1,200 challenging problems from various mathematical branches, vision-aid formulations, and difficulty levels.
We conduct evaluations on ten mainstream LLMs and LMMs, highlighting deficiencies in the visual-aided reasoning process.
arXiv Detail & Related papers (2024-10-30T13:19:44Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [37.26146689342965]
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning.
MLLMs tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics.
We aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision.
arXiv Detail & Related papers (2024-09-10T01:20:22Z) - CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models [35.9843681685377]
We release a Chinese multimodal math (CMM-Math) dataset to evaluate and enhance the mathematical reasoning of LMMs.
CMM-Math contains over 28,000 high-quality samples with detailed solutions across 12 grade levels from elementary to high school in China.
We propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments.
arXiv Detail & Related papers (2024-09-04T16:00:21Z) - MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark [29.9945601202065]
We propose MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information.
MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs.
We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models.
arXiv Detail & Related papers (2024-08-14T13:23:43Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning.
MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points.
The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.