From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
- URL: http://arxiv.org/abs/2512.02566v1
- Date: Tue, 02 Dec 2025 09:37:51 GMT
- Title: From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
- Authors: Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy,
- Abstract summary: We introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature.<n>Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels.<n>We develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases.
- Score: 86.7745150269054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
Related papers
- DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation [19.307501518696622]
We propose a prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision.<n>Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank.
arXiv Detail & Related papers (2025-12-11T06:03:28Z) - A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? [20.94974284175104]
Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources.<n>This paper revisits supervised, unimodal pre-training, using fine-grained labels instead.<n>We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources.
arXiv Detail & Related papers (2025-04-07T16:13:26Z) - BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation [9.262045402495225]
BiPVL-Seg is an end-to-end framework that integrates vision-language fusion and embedding alignment.<n>BiPVL-Seg introduces progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders.<n>It incorporates global-local contrastive alignment, a training objective that enhances the text encoder's comprehension by aligning text and vision embeddings at both class and concept levels.
arXiv Detail & Related papers (2025-03-30T17:34:39Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts? [14.547437214214485]
In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images.<n>We propose underlineMedical underlineUn underlineAdaptation (textttMedUnA) of Vision-Language Models (VLMs)<n>The LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter.
arXiv Detail & Related papers (2024-09-03T09:25:51Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Making the Most of Text Semantics to Improve Biomedical Vision--Language
Processing [17.96645738679543]
We show that textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing.
We propose a self-supervised joint vision--language approach with a focus on better text modelling.
arXiv Detail & Related papers (2022-04-21T00:04:35Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.