SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
- URL: http://arxiv.org/abs/2405.08807v2
- Date: Thu, 05 Dec 2024 17:52:49 GMT
- Title: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
- Authors: Jonathan Roberts, Kai Han, Neil Houlsby, Samuel Albanie,
- Abstract summary: We present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories.
The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control.
We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark.
- Score: 50.061029816288936
- License:
- Abstract: Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.
Related papers
- An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models [56.537253374781876]
Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks.
However, their spatial reasoning capabilities are under-investigated.
We construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-11-09T03:07:33Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities.
The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks.
We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z) - SciDFM: A Large Language Model with Mixture-of-Experts for Science [18.748699390397363]
We introduce SciDFM, a mixture-of-experts LLM that is trained from scratch and is able to conduct college-level scientific reasoning.
We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases.
We show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size.
arXiv Detail & Related papers (2024-09-27T03:00:29Z) - GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models [36.83397306207386]
We introduce GRAB, a graph analysis benchmark, fit for current and future LMMs.
Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions.
We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%.
arXiv Detail & Related papers (2024-08-21T17:59:32Z) - SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles.
We employ automatic and manual curation to create the dataset.
SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - MFBE: Leveraging Multi-Field Information of FAQs for Efficient Dense
Retrieval [1.7403133838762446]
We propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields.
Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets.
arXiv Detail & Related papers (2023-02-23T12:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.