Bidirectional Captioning for Clinically Accurate and Interpretable
Models
- URL: http://arxiv.org/abs/2310.19635v1
- Date: Mon, 30 Oct 2023 15:25:29 GMT
- Title: Bidirectional Captioning for Clinically Accurate and Interpretable
Models
- Authors: Keegan Quigley, Miriam Cha, Josh Barua, Geeticka Chauhan, Seth
Berkowitz, Steven Horng, Polina Golland
- Abstract summary: Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks.
In this paper, we experiment with bidirectional captioning of radiology reports as a form of pretraining and compare the quality and utility of learned embeddings with those from contrastive pretraining methods.
Results show that not only does captioning pretraining yield visual encoders that are competitive with contrastive pretraining (CheXpert competition multi-label AUC of 89.4%), but also that our transformer decoder is capable of generating clinically relevant reports.
- Score: 4.355562946859011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language pretraining has been shown to produce high-quality visual
encoders which transfer efficiently to downstream computer vision tasks. While
generative language models have gained widespread attention, image captioning
has thus far been mostly overlooked as a form of cross-modal pretraining in
favor of contrastive learning, especially in medical image analysis. In this
paper, we experiment with bidirectional captioning of radiology reports as a
form of pretraining and compare the quality and utility of learned embeddings
with those from contrastive pretraining methods. We optimize a CNN encoder,
transformer decoder architecture named RadTex for the radiology domain. Results
show that not only does captioning pretraining yield visual encoders that are
competitive with contrastive pretraining (CheXpert competition multi-label AUC
of 89.4%), but also that our transformer decoder is capable of generating
clinically relevant reports (captioning macro-F1 score of 0.349 using CheXpert
labeler) and responding to prompts with targeted, interactive outputs.
Related papers
- RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment [10.67889367763112]
RadAlign is a novel framework that combines the predictive accuracy of vision-language models with the reasoning capabilities of large language models.
Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI.
arXiv Detail & Related papers (2025-01-13T17:55:32Z) - Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.
Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.
Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training [0.1398098625978622]
Radiologic Contrastive Language-Image Pre-training (RadCLIP) is a vision-language foundational model that harnesses Vision Language Pre-training framework to improve radiologic image analysis.
RadCLIP was pre-trained to align radiologic images with their corresponding text annotations, creating a robust vision backbone for radiologic images.
Our Key contributions include curating a large dataset with diverse radiologic 2D/3D radiologic image-text pairs, a slice pooling adapter using an attention mechanism for integrating 2D images, and comprehensive evaluations of RadCLIP on various radiologic downstream tasks.
arXiv Detail & Related papers (2024-03-15T01:18:08Z) - Automatic Report Generation for Histopathology images using pre-trained
Vision Transformers [1.2781698000674653]
We show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and an LSTM decoder for report generation.
We are also able to use representations from an existing powerful pre-trained hierarchical vision transformer and show its usefulness in not just zero shot classification but also for report generation.
arXiv Detail & Related papers (2023-11-10T16:48:24Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - On the Importance of Image Encoding in Automated Chest X-Ray Report
Generation [4.843654097048771]
Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness.
There is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition.
automated radiology report generation can be a very helpful tool in clinical practice.
arXiv Detail & Related papers (2022-11-24T08:02:52Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.