Intensive Vision-guided Network for Radiology Report Generation
- URL: http://arxiv.org/abs/2402.03754v1
- Date: Tue, 6 Feb 2024 06:46:46 GMT
- Title: Intensive Vision-guided Network for Radiology Report Generation
- Authors: Fudan Zheng, Mengfei Li, Ying Wang, Weijiang Yu, Ruixuan Wang,
Zhiguang Chen, Nong Xiao, and Yutong Lu
- Abstract summary: We propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception.
We also explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction.
- Score: 22.030289124516326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic radiology report generation is booming due to its huge application
potential for the healthcare industry. However, existing computer vision and
natural language processing approaches to tackle this problem are limited in
two aspects. First, when extracting image features, most of them neglect
multi-view reasoning in vision and model single-view structure of medical
images, such as space-view or channel-view. However, clinicians rely on
multi-view imaging information for comprehensive judgment in daily clinical
diagnosis. Second, when generating reports, they overlook context reasoning
with multi-modal information and focus on pure textual optimization utilizing
retrieval-based methods. We aim to address these two issues by proposing a
model that better simulates clinicians' perspectives and generates more
accurate reports. Given the above limitation in feature extraction, we propose
a Globally-intensive Attention (GIA) module in the medical image encoder to
simulate and integrate multi-view vision perception. GIA aims to learn three
types of vision perception: depth view, space view, and pixel view. On the
other hand, to address the above problem in report generation, we explore how
to involve multi-modal signals to generate precisely matched reports, i.e., how
to integrate previously predicted words with region-aware visual content in
next word prediction. Specifically, we design a Visual Knowledge-guided Decoder
(VKGD), which can adaptively consider how much the model needs to rely on
visual information and previously predicted text to assist next word
prediction. Hence, our final Intensive Vision-guided Network (IVGN) framework
includes a GIA-guided Visual Encoder and the VKGD. Experiments on two
commonly-used datasets IU X-Ray and MIMIC-CXR demonstrate the superior ability
of our method compared with other state-of-the-art approaches.
Related papers
- ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - DeViDe: Faceted medical knowledge for improved medical vision-language pre-training [1.6567372257085946]
Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports.
We propose DeViDe, a transformer-based method that leverages radiographic descriptions from the open web.
DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources.
In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets.
arXiv Detail & Related papers (2024-04-04T17:40:06Z) - RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text
Supervision [44.00149519249467]
Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images.
We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data.
arXiv Detail & Related papers (2024-01-19T17:02:17Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - AlignTransformer: Hierarchical Alignment of Visual Regions and Disease
Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z) - Multi-modal Understanding and Generation for Medical Images and Text via
Vision-Language Pre-Training [5.119201893752376]
We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme.
We empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.
arXiv Detail & Related papers (2021-05-24T15:14:09Z) - Act Like a Radiologist: Towards Reliable Multi-view Correspondence
Reasoning for Mammogram Mass Detection [49.14070210387509]
We propose an Anatomy-aware Graph convolutional Network (AGN) for mammogram mass detection.
AGN is tailored for mammogram mass detection and endows existing detection methods with multi-view reasoning ability.
Experiments on two standard benchmarks reveal that AGN significantly exceeds the state-of-the-art performance.
arXiv Detail & Related papers (2021-05-21T06:48:34Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.