Related papers: ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

URL: http://arxiv.org/abs/2303.01615v2
Date: Fri, 15 Sep 2023 21:48:20 GMT
Title: ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax
Authors: Zachary Huemann, Xin Tie, Junjie Hu, Tyler J. Bradshaw
Abstract summary: We propose a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. We trained it on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax. It achieved a Dice score of 0.716$pm$0.016, which was similar to the degree of inter-reader variability. It outperformed both vision-only models and a competing vision-language model.
Score: 5.168314889999992
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net utilizes language features extracted from corresponding free-form radiology reports using a pre-trained language model. Cross-attention modules are designed to combine the intermediate output of each vision encoder layer and the text embeddings generated by the language model. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716$\pm$0.016, which was similar to the degree of inter-reader variability (0.712$\pm$0.044) computed on a subset of the data. It outperformed both vision-only models (ResNet50 U-Net: 0.677$\pm$0.015 and GLoRIA: 0.686$\pm$0.014) and a competing vision-language model (LAVT: 0.706$\pm$0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.

Related papers

Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation [4.286815457787583]
We introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners.<n>First, Linguistic Adrial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles.<n>Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time.
arXiv Detail & Related papers (2025-07-04T13:52:16Z)
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification [19.29480118378639]
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification.
arXiv Detail & Related papers (2025-02-11T09:42:13Z)
RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment [10.67889367763112]
RadAlign is a novel framework that combines the predictive accuracy of vision-language models with the reasoning capabilities of large language models. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI.
arXiv Detail & Related papers (2025-01-13T17:55:32Z)
A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia [12.129896943547912]
We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia. Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module.
arXiv Detail & Related papers (2024-09-03T21:28:48Z)
Contrastive Learning with Counterfactual Explanations for Radiology Report Generation [83.30609465252441]
We propose a textbfCountertextbfFactual textbfExplanations-based framework (CoFE) for radiology report generation. Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking what if'' scenarios. Experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports.
arXiv Detail & Related papers (2024-07-19T17:24:25Z)
CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting [0.0]
We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation. We find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. We develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools.
arXiv Detail & Related papers (2024-07-11T18:39:19Z)
Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation [42.06416052431378]
2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. We collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
arXiv Detail & Related papers (2024-07-02T12:58:35Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts [62.55349777609194]
We aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT. We build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans. We have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters) demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets.
arXiv Detail & Related papers (2023-12-28T18:16:00Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z)
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports [5.074841553282345]
In this study, we adopt four pre-trained V+L models to learn multimodal representation from MIMIC-CXR radiographs and associated reports. In comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task.
arXiv Detail & Related papers (2020-09-03T09:00:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.