ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of
Pneumothorax
- URL: http://arxiv.org/abs/2303.01615v2
- Date: Fri, 15 Sep 2023 21:48:20 GMT
- Title: ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of
Pneumothorax
- Authors: Zachary Huemann, Xin Tie, Junjie Hu, Tyler J. Bradshaw
- Abstract summary: We propose a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs.
We trained it on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax.
It achieved a Dice score of 0.716$pm$0.016, which was similar to the degree of inter-reader variability.
It outperformed both vision-only models and a competing vision-language model.
- Score: 5.168314889999992
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Radiology narrative reports often describe characteristics of a patient's
disease, including its location, size, and shape. Motivated by the recent
success of multimodal learning, we hypothesized that this descriptive text
could guide medical image analysis algorithms. We proposed a novel
vision-language model, ConTEXTual Net, for the task of pneumothorax
segmentation on chest radiographs. ConTEXTual Net utilizes language features
extracted from corresponding free-form radiology reports using a pre-trained
language model. Cross-attention modules are designed to combine the
intermediate output of each vision encoder layer and the text embeddings
generated by the language model. ConTEXTual Net was trained on the CANDID-PTX
dataset consisting of 3,196 positive cases of pneumothorax with segmentation
annotations from 6 different physicians as well as clinical radiology reports.
Using cross-validation, ConTEXTual Net achieved a Dice score of
0.716$\pm$0.016, which was similar to the degree of inter-reader variability
(0.712$\pm$0.044) computed on a subset of the data. It outperformed both
vision-only models (ResNet50 U-Net: 0.677$\pm$0.015 and GLoRIA:
0.686$\pm$0.014) and a competing vision-language model (LAVT: 0.706$\pm$0.009).
Ablation studies confirmed that it was the text information that led to the
performance gains. Additionally, we show that certain augmentation methods
degraded ConTEXTual Net's segmentation performance by breaking the image-text
concordance. We also evaluated the effects of using different language models
and activation functions in the cross-attention module, highlighting the
efficacy of our chosen architectural design.
Related papers
- A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia [12.129896943547912]
We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia.
Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module.
arXiv Detail & Related papers (2024-09-03T21:28:48Z) - Contrastive Learning with Counterfactual Explanations for Radiology Report Generation [83.30609465252441]
We propose a textbfCountertextbfFactual textbfExplanations-based framework (CoFE) for radiology report generation.
Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking what if'' scenarios.
Experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports.
arXiv Detail & Related papers (2024-07-19T17:24:25Z) - CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting [0.0]
We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation.
We find that vision-language models often hallucinate with confident language, which slows down clinical interpretation.
We develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools.
arXiv Detail & Related papers (2024-07-11T18:39:19Z) - Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation [42.06416052431378]
2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy.
We collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning to train BrainGPT models to generate radiology-adherent 3D brain CT reports.
Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
arXiv Detail & Related papers (2024-07-02T12:58:35Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts [62.55349777609194]
We aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT.
We build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans.
We have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters) demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets.
arXiv Detail & Related papers (2023-12-28T18:16:00Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records.
The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z) - A Comparison of Pre-trained Vision-and-Language Models for Multimodal
Representation Learning across Medical Images and Reports [5.074841553282345]
In this study, we adopt four pre-trained V+L models to learn multimodal representation from MIMIC-CXR radiographs and associated reports.
In comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task.
arXiv Detail & Related papers (2020-09-03T09:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.