Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment
- URL: http://arxiv.org/abs/2109.01949v1
- Date: Sat, 4 Sep 2021 22:58:35 GMT
- Title: Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment
- Authors: Zhanghexuan Ji, Mohammad Abuzar Shaikh, Dana Moukheiber, Sargur
Srihari, Yifan Peng, Mingchen Gao
- Abstract summary: This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports.
The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
- Score: 9.265044250068554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning provides an opportunity to explore unlabeled chest
X-rays and their associated free-text reports accumulated in clinical routine
without manual supervision. This paper proposes a Joint Image Text
Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray
images and their radiology reports. The model was pre-trained on both the
global image-sentence level and the local image region-word level for
visual-textual matching. Both are bidirectionally constrained on Cross-Entropy
based and ranking-based Triplet Matching Losses. The region-word matching is
calculated using the attention mechanism without direct supervision about their
mapping. The pre-trained multi-modal representation learning paves the way for
downstream tasks concerning image and/or text encoding. We demonstrate the
representation learning quality by cross-modality retrievals and multi-label
classifications on two datasets: OpenI-IU and MIMIC-CXR
Related papers
- Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.
Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.
Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
Representation Learning [24.215619918283462]
We present a novel framework for learning medical visual representations directly from paired radiology reports.
Our framework harnesses the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels.
arXiv Detail & Related papers (2022-10-12T09:31:39Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Multi-Label Generalized Zero Shot Learning for the Classification of
Disease in Chest Radiographs [0.7734726150561088]
We propose a zero shot learning network that can simultaneously predict multiple seen and unseen diseases in chest X-ray images.
The network is end-to-end trainable and requires no independent pre-training for the offline feature extractor.
Our network outperforms two strong baselines in terms of recall, precision, f1 score, and area under the receiver operating characteristic curve.
arXiv Detail & Related papers (2021-07-14T09:04:20Z) - Weakly-Supervised Segmentation for Disease Localization in Chest X-Ray
Images [0.0]
We propose a novel approach to the semantic segmentation of medical chest X-ray images with only image-level class labels as supervision.
We show that this approach is applicable to chest X-rays for detecting an anomalous volume of air between the lung and the chest wall.
arXiv Detail & Related papers (2020-07-01T20:48:35Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.