MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark
- URL: http://arxiv.org/abs/2408.07543v3
- Date: Fri, 23 Aug 2024 14:09:34 GMT
- Title: MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark
- Authors: Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou,
- Abstract summary: We propose MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information.
MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs.
We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models.
- Score: 29.9945601202065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of Multimodal Large Language Models (MLLMs), the evaluation of multimodal models in the context of mathematical problems has become a valuable research field. Multimodal visual-textual mathematical reasoning serves as a critical indicator for evaluating the comprehension and complex multi-step quantitative reasoning abilities of MLLMs. However, previous multimodal math benchmarks have not sufficiently integrated visual and textual information. To address this gap, we proposed MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models. By analyzing the evaluation results, we identify the limitations of MLLMs, offering valuable insights for enhancing model performance.
Related papers
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.
We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.
EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [25.884425372310147]
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning.
MLLMs tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics.
We aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision.
arXiv Detail & Related papers (2024-09-10T01:20:22Z) - Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning [5.9767694994869425]
Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems.
They struggle with mathematical diagrams since they are primarily trained on natural scene images.
We propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment.
arXiv Detail & Related papers (2024-08-16T10:11:05Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset [33.65525875690291]
We present the MATH-Vision dataset, a collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V.
Our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.
arXiv Detail & Related papers (2024-02-22T18:56:38Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.