MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts
- URL: http://arxiv.org/abs/2310.02255v3
- Date: Sun, 21 Jan 2024 03:47:06 GMT
- Title: MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts
- Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh
Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao
- Abstract summary: MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
- Score: 170.01089233942594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive problem-solving skills in many tasks and domains, but their ability
in mathematical reasoning in visual contexts has not been systematically
studied. To bridge this gap, we present MathVista, a benchmark designed to
combine challenges from diverse mathematical and visual tasks. It consists of
6,141 examples, derived from 28 existing multimodal datasets involving
mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and
PaperQA). Completing these tasks requires fine-grained, deep visual
understanding and compositional reasoning, which all state-of-the-art
foundation models find challenging. With MathVista, we have conducted a
comprehensive, quantitative evaluation of 12 prominent foundation models. The
best-performing GPT-4V model achieves an overall accuracy of 49.9%,
substantially outperforming Bard, the second-best performer, by 15.1%. Our
in-depth analysis reveals that the superiority of GPT-4V is mainly attributed
to its enhanced visual perception and mathematical reasoning. However, GPT-4V
still falls short of human performance by 10.4%, as it often struggles to
understand complex figures and perform rigorous reasoning. This significant gap
underscores the critical role that MathVista will play in the development of
general-purpose AI agents capable of tackling mathematically intensive and
visually rich real-world tasks. We further explore the new ability of
self-verification, the application of self-consistency, and the interactive
chatbot capabilities of GPT-4V, highlighting its promising potential for future
research. The project is available at https://mathvista.github.io/.
Related papers
- DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models [19.787224412654872]
We introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of Vision-Language Models (VLMs)
DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program.
Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy.
arXiv Detail & Related papers (2024-10-29T17:29:19Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning
Tasks [37.730939229638224]
We propose NumGLUE, a benchmark that evaluates the performance of AI systems on eight different tasks.
We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models.
We hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language.
arXiv Detail & Related papers (2022-04-12T09:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.