Related papers: Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

URL: http://arxiv.org/abs/2402.14804v1
Date: Thu, 22 Feb 2024 18:56:38 GMT
Title: Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Authors: Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li
Abstract summary: We present the MATH-Vision dataset, a collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V. Our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.
Score: 33.65525875690291
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The project is available at https://mathvision-cuhk.github.io

Related papers

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [34.972503583614674]
We introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels. We observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH.
arXiv Detail & Related papers (2025-02-28T07:50:36Z)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance. We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z)
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning [47.81337826098964]
We present VisAidMath, a benchmark for evaluating the problem-solving process related to visual information. This benchmark includes 1,200 challenging problems from various mathematical branches, vision-aid formulations, and difficulty levels. We conduct evaluations on ten mainstream LLMs and LMMs, highlighting deficiencies in the visual-aided reasoning process.
arXiv Detail & Related papers (2024-10-30T13:19:44Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [37.26146689342965]
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning. MLLMs tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. We aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision.
arXiv Detail & Related papers (2024-09-10T01:20:22Z)
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models [35.9843681685377]
We release a Chinese multimodal math (CMM-Math) dataset to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples with detailed solutions across 12 grade levels from elementary to high school in China. We propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments.
arXiv Detail & Related papers (2024-09-04T16:00:21Z)
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark [29.9945601202065]
We propose MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models.
arXiv Detail & Related papers (2024-08-14T13:23:43Z)
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5. Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.