Related papers: MuSciClaims: Multimodal Scientific Claim Verification

MuSciClaims: Multimodal Scientific Claim Verification

URL: http://arxiv.org/abs/2506.04585v2
Date: Wed, 30 Jul 2025 03:05:30 GMT
Title: MuSciClaims: Multimodal Scientific Claim Verification
Authors: Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian,
Abstract summary: We introduce a new benchmark MuSciClaims accompanied by diagnostics tasks.<n>We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims.<n>Our results show most vision-language models are poor (0.3-0.5 F1), with even the best model only achieving 0.72 F1.
Score: 13.598508835610474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

Related papers

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts [19.571644726057666]
We design and conduct experiments to assess the ability of multimodal large language models to verify scientific claims using both tables and charts as evidence.<n>Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence.<n>Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization.
arXiv Detail & Related papers (2025-11-13T08:29:47Z)
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts [49.99400612296149]
We find that models can ace many benchmarks without strong visual understanding.<n>This is especially problematic for vision-centric benchmarks that are meant to require visual inputs.<n>We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be gamed.
arXiv Detail & Related papers (2025-11-06T18:43:21Z)
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning [22.39245479538899]
We introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result.<n>A model-agnostic evaluation layer treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing.<n>A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead)
arXiv Detail & Related papers (2025-11-04T18:20:13Z)
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies [16.537126902822127]
We introduce PRISMM-Bench, the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers.<n>We design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies.<n>We benchmark 21 leading LMMs, including large openweight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning)
arXiv Detail & Related papers (2025-10-18T13:46:26Z)
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts [4.809421212365958]
We introduce MedFact, a new benchmark for Chinese medical fact-checking.<n>It comprises 2,116 expert-annotated instances curated from diverse real-world texts.<n>It employs a hybrid AI-human framework where expert feedback refines an AI-driven, multi-criteria filtering process.
arXiv Detail & Related papers (2025-09-15T20:46:21Z)
Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
Current approaches to visual question answering often struggle with the precision required for scientific data interpretation.<n>We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.<n>Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
arXiv Detail & Related papers (2025-07-08T17:05:42Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features [0.0]
Inaccuracies in polygraph tests often lead to wrongful convictions, false information, and bias. analyzing facial micro-expressions has emerged as a method for detecting deception. The CNN Conv1D multimodal model achieved an average accuracy of 95.4%.
arXiv Detail & Related papers (2024-10-26T22:17:36Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables [68.76415918462418]
We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning.
arXiv Detail & Related papers (2023-05-22T16:13:50Z)
Logically at the Factify 2022: Multimodal Fact Verification [2.8914815569249823]
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Two baseline approaches are proposed and explored including an ensemble model and a multi-modal attention network. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set.
arXiv Detail & Related papers (2021-12-16T23:34:07Z)
Training Verifiers to Solve Math Word Problems [12.307284507186342]
We introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance. To increase performance, we propose training verifiers to judge the correctness of model completions.
arXiv Detail & Related papers (2021-10-27T04:49:45Z)
A Multi-Level Attention Model for Evidence-Based Fact Checking [58.95413968110558]
We present a simple model that can be trained on sequence structures. Results on a large-scale dataset for Fact Extraction and VERification show that our model outperforms the graph-based approaches.
arXiv Detail & Related papers (2021-06-02T05:40:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.