Bidirectional Captioning for Clinically Accurate and Interpretable
Models
- URL: http://arxiv.org/abs/2310.19635v1
- Date: Mon, 30 Oct 2023 15:25:29 GMT
- Title: Bidirectional Captioning for Clinically Accurate and Interpretable
Models
- Authors: Keegan Quigley, Miriam Cha, Josh Barua, Geeticka Chauhan, Seth
Berkowitz, Steven Horng, Polina Golland
- Abstract summary: Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks.
In this paper, we experiment with bidirectional captioning of radiology reports as a form of pretraining and compare the quality and utility of learned embeddings with those from contrastive pretraining methods.
Results show that not only does captioning pretraining yield visual encoders that are competitive with contrastive pretraining (CheXpert competition multi-label AUC of 89.4%), but also that our transformer decoder is capable of generating clinically relevant reports.
- Score: 4.355562946859011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language pretraining has been shown to produce high-quality visual
encoders which transfer efficiently to downstream computer vision tasks. While
generative language models have gained widespread attention, image captioning
has thus far been mostly overlooked as a form of cross-modal pretraining in
favor of contrastive learning, especially in medical image analysis. In this
paper, we experiment with bidirectional captioning of radiology reports as a
form of pretraining and compare the quality and utility of learned embeddings
with those from contrastive pretraining methods. We optimize a CNN encoder,
transformer decoder architecture named RadTex for the radiology domain. Results
show that not only does captioning pretraining yield visual encoders that are
competitive with contrastive pretraining (CheXpert competition multi-label AUC
of 89.4%), but also that our transformer decoder is capable of generating
clinically relevant reports (captioning macro-F1 score of 0.349 using CheXpert
labeler) and responding to prompts with targeted, interactive outputs.
Related papers
- Image Captioners Are Scalable Vision Learners Too [61.98796478791261]
Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones.
Our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
arXiv Detail & Related papers (2023-06-13T17:18:01Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - On the Importance of Image Encoding in Automated Chest X-Ray Report
Generation [4.843654097048771]
Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness.
There is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition.
automated radiology report generation can be a very helpful tool in clinical practice.
arXiv Detail & Related papers (2022-11-24T08:02:52Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Understanding Transfer Learning for Chest Radiograph Clinical Report
Generation with Modified Transformer Architectures [0.0]
We train a series of modified transformers to generate clinical reports from chest radiograph image input.
We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models.
arXiv Detail & Related papers (2022-05-05T03:08:05Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.