Related papers: MPath: Multimodal Pathology Report Generation from Whole Slide Images

MPath: Multimodal Pathology Report Generation from Whole Slide Images

URL: http://arxiv.org/abs/2512.11906v1
Date: Wed, 10 Dec 2025 17:54:38 GMT
Title: MPath: Multimodal Pathology Report Generation from Whole Slide Images
Authors: Noorul Wahab, Nasir Rajpoot,
Abstract summary: We introduce MPath, a lightweight framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings.<n>MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities.<n>The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.

Related papers

A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z)
From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature [86.7745150269054]
We introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature.<n>Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels.<n>We develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases.
arXiv Detail & Related papers (2025-12-02T09:37:51Z)
HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology [7.982889842329205]
HiFusion is a novel deep learning framework that integrates two complementary components.<n>We show that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios.<n>These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.
arXiv Detail & Related papers (2025-11-17T04:47:39Z)
PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction [2.638791169659607]
Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data.<n>We propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction.
arXiv Detail & Related papers (2025-09-24T11:37:52Z)
BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA [5.840467499436581]
We propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA.<n>BioD2C achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question.<n>In this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context.
arXiv Detail & Related papers (2025-03-04T10:39:42Z)
A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation [12.948027961485536]
We propose a novel Weakly Supervised Semantic (WSSS) approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels. Our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.
arXiv Detail & Related papers (2024-11-19T16:20:27Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
PathAlign: A vision-language model for whole slide images in histopathology [13.567674461880905]
We develop a vision-language model based on the BLIP-2 framework using WSIs and curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization.
arXiv Detail & Related papers (2024-06-27T23:43:36Z)
Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z)
WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images. We curated the largest WSI-text dataset (PathText) On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.