Related papers: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

URL: http://arxiv.org/abs/2405.08807v2
Date: Thu, 05 Dec 2024 17:52:49 GMT
Title: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
Authors: Jonathan Roberts, Kai Han, Neil Houlsby, Samuel Albanie,
Abstract summary: We present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories.<n>The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control.<n>We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark.
Score: 50.061029816288936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.

Related papers

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z)
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems [12.931916288612483]
We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard.
arXiv Detail & Related papers (2025-02-27T15:38:43Z)
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models [56.537253374781876]
Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. We construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-11-09T03:07:33Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks. We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z)
FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering [22.245216871611678]
FAMMA is an open-source benchmark for financial multilingual multimodal question answering. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models.
arXiv Detail & Related papers (2024-10-06T15:41:26Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
SciDFM: A Large Language Model with Mixture-of-Experts for Science [18.748699390397363]
We introduce SciDFM, a mixture-of-experts LLM that is trained from scratch and is able to conduct college-level scientific reasoning. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size.
arXiv Detail & Related papers (2024-09-27T03:00:29Z)
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles. We employ automatic and manual curation to create the dataset. SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z)
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
MFBE: Leveraging Multi-Field Information of FAQs for Efficient Dense Retrieval [1.7403133838762446]
We propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets.
arXiv Detail & Related papers (2023-02-23T12:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.