GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
- URL: http://arxiv.org/abs/2408.11817v2
- Date: Thu, 29 Aug 2024 17:47:47 GMT
- Title: GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
- Authors: Jonathan Roberts, Kai Han, Samuel Albanie,
- Abstract summary: We introduce GRAB, a graph analysis benchmark, fit for current and future LMMs.
Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions.
We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%.
- Score: 36.83397306207386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.
Related papers
- Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models [10.828419851213528]
We propose the Multi-Dimensional Insights benchmark, which includes over 500 images covering six common scenarios of human life.
This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups.
Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs.
arXiv Detail & Related papers (2024-12-17T07:06:10Z) - HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.
Each task features carefully crafted diagrams paired with function signatures and test cases.
We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension [53.6373473053431]
This work introduces a benchmark to assess large language models' capabilities in graph pattern tasks.
We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions.
Our benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models.
arXiv Detail & Related papers (2024-10-04T04:48:33Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - GraphFM: A Comprehensive Benchmark for Graph Foundation Model [33.157367455390144]
Foundation Models (FMs) serve as a general class for the development of artificial intelligence systems.
Despite extensive research into self-supervised learning as the cornerstone of FMs, several outstanding issues persist.
The extent of generalization capability on downstream tasks remains unclear.
It is unknown how effectively these models can scale to large datasets.
arXiv Detail & Related papers (2024-06-12T15:10:44Z) - A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)
Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.