Related papers: SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

URL: http://arxiv.org/abs/2506.15569v1
Date: Wed, 18 Jun 2025 15:43:26 GMT
Title: SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Authors: Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao,
Abstract summary: SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification.<n>We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL.<n>Our experiment reveals a substantial performance gap between these models and human experts on SciVer.
Score: 29.63899315962693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

Related papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers [22.83126850650448]
This paper introduces MISS-QA, the first benchmark designed to evaluate the ability of models to interpret schematic diagrams within scientific literature.<n> MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers.<n>We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL.
arXiv Detail & Related papers (2025-07-14T20:35:25Z)
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks [87.29946641069068]
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks.<n>By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks.<n>We release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data.
arXiv Detail & Related papers (2025-07-01T17:51:59Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning [64.7438151207189]
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research.<n>We present textbfARtificial textbfIntelligence research assistant for textbfExpert-involved textbfLearning (ARIEL)
arXiv Detail & Related papers (2025-05-03T14:21:48Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context [13.967898012303325]
We introduce LiveIdeaBench, a benchmark evaluating Large Language Models' scientific idea generation.<n>Our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity.<n>Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores.
arXiv Detail & Related papers (2024-12-23T14:13:44Z)
Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field [0.0]
This paper offers an analysis of the ability of large models to identify semantic relationships between different research topics.<n>We developed a gold standard based on the IEEE Thesaurus to evaluate the task.<n>Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral, and Claude 3-7B.
arXiv Detail & Related papers (2024-12-11T10:11:41Z)
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models [27.910693214922052]
We introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark for evaluating foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents.
arXiv Detail & Related papers (2024-11-06T17:52:01Z)
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [20.56989082014445]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.<n>We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.<n>The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables [68.76415918462418]
We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning.
arXiv Detail & Related papers (2023-05-22T16:13:50Z)
SciFact-Open: Towards open-domain scientific claim verification [61.288725621156864]
We present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems. We collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1.
arXiv Detail & Related papers (2022-10-25T05:45:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.