On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
- URL: http://arxiv.org/abs/2502.19285v2
- Date: Thu, 27 Feb 2025 09:06:34 GMT
- Title: On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
- Authors: Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, Gerben E. Breimer, Willeke A. M. Blokx, Mitko Veta,
- Abstract summary: Vision-language models in pathology enable multimodal case retrieval and automated report generation.<n>Many of the models developed so far have been trained on pathology reports that include information which cannot be inferred from paired whole slide images.<n>We investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports.
- Score: 0.7966328552094392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
Related papers
- Causal Disentanglement for Robust Long-tail Medical Image Generation [80.15257897500578]
We propose a novel medical image generation framework, which generates independent pathological and structural features.
We leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images.
arXiv Detail & Related papers (2025-04-20T01:54:18Z) - Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation [54.631356899598956]
We propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports.<n>We employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information.
arXiv Detail & Related papers (2025-01-07T01:19:48Z) - Clinical-grade Multi-Organ Pathology Report Generation for Multi-scale Whole Slide Images via a Semantically Guided Medical Text Foundation Model [3.356716093747221]
We propose a novel Patient-level Multi-organ Pathology Report Generation (PMPRG) model to generate pathology reports for patients.
Our model achieved a METEOR score of 0.68, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-23T22:22:32Z) - Contrastive Learning with Counterfactual Explanations for Radiology Report Generation [83.30609465252441]
We propose a textbfCountertextbfFactual textbfExplanations-based framework (CoFE) for radiology report generation.
Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking what if'' scenarios.
Experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports.
arXiv Detail & Related papers (2024-07-19T17:24:25Z) - Application Of Vision-Language Models For Assessing Osteoarthritis
Disease Severity [0.43431539537721414]
Osteoarthritis (OA) poses a global health challenge, demanding precise diagnostic methods.
Existing deep learning models for OA assessment are unimodal single task systems.
This study investigates employing Vision Language Processing models to predict OA severity using Xray images and corresponding reports.
arXiv Detail & Related papers (2024-01-12T02:43:58Z) - WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images.
We curated the largest WSI-text dataset (PathText)
On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Variational Topic Inference for Chest X-Ray Report Generation [102.04931207504173]
Report generation for medical imaging promises to reduce workload and assist diagnosis in clinical practice.
Recent work has shown that deep learning models can successfully caption natural images.
We propose variational topic inference for automatic report generation.
arXiv Detail & Related papers (2021-07-15T13:34:38Z) - A Comparison of Pre-trained Vision-and-Language Models for Multimodal
Representation Learning across Medical Images and Reports [5.074841553282345]
In this study, we adopt four pre-trained V+L models to learn multimodal representation from MIMIC-CXR radiographs and associated reports.
In comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task.
arXiv Detail & Related papers (2020-09-03T09:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.