Related papers: S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

URL: http://arxiv.org/abs/2601.00264v1
Date: Thu, 01 Jan 2026 08:54:51 GMT
Title: S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
Authors: He Wang, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan, Jie Jiang, Jing Liu,
Abstract summary: S1-MMAlign is a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs.<n>We introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts.
Score: 16.351123624587384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Related papers

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding [13.03315906747549]
We introduce OmniScience, a high-fidelity multi-modal dataset spanning more than 10 major scientific disciplines.<n>We develop a dynamic model-routing re-captioning pipeline that generates dense, self-contained descriptions.<n> pipeline is reinforced with rigorous quality filtering and alignment with human expert judgments.
arXiv Detail & Related papers (2026-02-14T13:08:13Z)
Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment [51.96615529872665]
We propose CIEA, a novel multimodal retrieval approach that transforms both text and images in documents into a unified latent space.<n>We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images.
arXiv Detail & Related papers (2026-01-08T04:02:49Z)
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation [13.362188283113788]
Vision-language pretraining has emerged as a powerful paradigm in medical image analysis.<n>We propose a novel framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining.
arXiv Detail & Related papers (2025-12-03T04:55:54Z)
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles.<n>We employ automatic and manual curation to create the dataset.<n> SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models [51.98253148764755]
We introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains. ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures.
arXiv Detail & Related papers (2024-03-01T02:21:30Z)
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap. We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z)
Multimodal Deep Learning for Scientific Imaging Interpretation [0.0]
This study presents a novel methodology to linguistically emulate and evaluate human-like interactions with Scanning Electron Microscopy (SEM) images. Our approach distills insights from both textual and visual data harvested from peer-reviewed articles. Our model (GlassLLaVA) excels in crafting accurate interpretations, identifying key features, and detecting defects in previously unseen SEM images.
arXiv Detail & Related papers (2023-09-21T20:09:22Z)
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor. PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.