Related papers: OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

URL: http://arxiv.org/abs/2602.13758v1
Date: Sat, 14 Feb 2026 13:08:13 GMT
Title: OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
Authors: Haoyi Tao, Chaozheng Huang, Nan Wang, Han Lyu, Linfeng Zhang, Guolin Ke, Xi Fang,
Abstract summary: We introduce OmniScience, a high-fidelity multi-modal dataset spanning more than 10 major scientific disciplines.<n>We develop a dynamic model-routing re-captioning pipeline that generates dense, self-contained descriptions.<n> pipeline is reinforced with rigorous quality filtering and alignment with human expert judgments.
Score: 13.03315906747549
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain [10.823938734002288]
This paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT.<n> Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters.
arXiv Detail & Related papers (2026-01-14T09:01:46Z)
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding [16.351123624587384]
S1-MMAlign is a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs.<n>We introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts.
arXiv Detail & Related papers (2026-01-01T08:54:51Z)
Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning [0.03214166687856062]
We introduce a scalable subfigure extraction pipeline based on transformer-based object detection.<n>We release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset.<n>We show improved performance across retrieval, zero-shot classification, and robustness benchmarks.
arXiv Detail & Related papers (2025-06-03T10:53:19Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.<n>However, the limited labeled multimodal data often hinders embedding performance.<n>Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Our dataset has 15 times larger scales while maintaining good data quality. We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT. We first represent the input sentence and image using a unified multi-modal graph. We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.