Related papers: SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

URL: http://arxiv.org/abs/2601.10108v1
Date: Thu, 15 Jan 2026 06:25:25 GMT
Title: SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
Authors: Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang, Minghao Liu, Yujiu Yang,
Abstract summary: "Fish-in-the-Ocean" (FITO) paradigm requires models to construct explicit cross-modal evidence chains within scientific documents.<n>We construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA) and evidence-anchored synthesis (SIN-Summary)<n>We introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic.
Score: 92.88058660627678
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

Related papers

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse [1.4608214000864057]
CausalT5K is a diagnostic benchmark of over 5,000 cases across 10 domains.<n>Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives.<n>Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail.
arXiv Detail & Related papers (2026-02-09T17:36:56Z)
RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z)
ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction [70.53044880892196]
We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
arXiv Detail & Related papers (2025-11-16T07:37:09Z)
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines [112.78540935201558]
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations.<n>The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions.<n>It supports four capability families, covering up to 103 tasks across: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design.
arXiv Detail & Related papers (2025-09-25T17:52:06Z)
Unstructured Evidence Attribution for Long Context Query Focused Summarization [53.08341620504465]
We propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case.<n>We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle"
arXiv Detail & Related papers (2025-02-20T09:57:42Z)
CORRECT: Context- and Reference-Augmented Reasoning and Prompting for Fact-Checking [14.890042094350411]
We propose a novel method, Context- and Reference-augmented Reasoning and Prompting.<n>For evidence reasoning, we construct a three-layer evidence graph with evidence, context, and reference layers.<n>For verdict prediction, we design evidence-conditioned prompt encoder, which produces unique prompt embeddings for each claim.
arXiv Detail & Related papers (2025-02-09T01:41:15Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification [0.791663505497707]
Retrieve to Explain (R2E) is a retrieval-based model that scores and ranks all possible answers to a research question.<n>R2E represents each answer only in terms of its supporting evidence, with the answer itself masked.<n>We developed R2E for the challenging scientific discovery task of drug target identification.
arXiv Detail & Related papers (2024-02-06T15:13:17Z)
THiFLY Research at SemEval-2023 Task 7: A Multi-granularity System for CTR-based Textual Entailment and Evidence Retrieval [13.30918296659228]
The NLI4CT task aims to entail hypotheses based on Clinical Trial Reports (CTRs) and retrieve the corresponding evidence supporting the justification. We present a multi-granularity system for CTR-based textual entailment and evidence retrieval. We enhance the numerical inference capability of the system by leveraging a T5-based model, SciFive, which is pre-trained on the medical corpus.
arXiv Detail & Related papers (2023-06-02T03:09:31Z)
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding [23.930041685595775]
We present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation.
arXiv Detail & Related papers (2021-05-23T11:08:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.