Related papers: On the Importance of Image Encoding in Automated Chest X-Ray Report Generation

On the Importance of Image Encoding in Automated Chest X-Ray Report Generation

URL: http://arxiv.org/abs/2211.13465v1
Date: Thu, 24 Nov 2022 08:02:52 GMT
Title: On the Importance of Image Encoding in Automated Chest X-Ray Report Generation
Authors: Otabek Nazarov, Mohammad Yaqub, Karthik Nandakumar
Abstract summary: Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness. There is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition. automated radiology report generation can be a very helpful tool in clinical practice.
Score: 4.843654097048771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness. However, there is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition. Therefore, automated radiology report generation can be a very helpful tool in clinical practice. A typical report generation workflow consists of two main steps: (i) encoding the image into a latent space and (ii) generating the text of the report based on the latent image embedding. Many existing report generation techniques use a standard convolutional neural network (CNN) architecture for image encoding followed by a Transformer-based decoder for medical text generation. In most cases, CNN and the decoder are trained jointly in an end-to-end fashion. In this work, we primarily focus on understanding the relative importance of encoder and decoder components. Towards this end, we analyze four different image encoding approaches: direct, fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with three different decoders on the large-scale MIMIC-CXR dataset. Among these encoders, the cluster CLIP visual encoder is a novel approach that aims to generate more discriminative and explainable representations. CLIP-based encoders produce comparable results to traditional CNN-based encoders in terms of NLP metrics, while fine-grained encoding outperforms all other encoders both in terms of NLP and clinical accuracy metrics, thereby validating the importance of image encoder to effectively extract semantic information. GitHub repository: https://github.com/mudabek/encoding-cxr-report-gen

Related papers

BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation [0.0]
This paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities.
arXiv Detail & Related papers (2024-02-13T21:03:36Z)
Bidirectional Captioning for Clinically Accurate and Interpretable Models [4.355562946859011]
Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks. In this paper, we experiment with bidirectional captioning of radiology reports as a form of pretraining and compare the quality and utility of learned embeddings with those from contrastive pretraining methods. Results show that not only does captioning pretraining yield visual encoders that are competitive with contrastive pretraining (CheXpert competition multi-label AUC of 89.4%), but also that our transformer decoder is capable of generating clinically relevant reports.
arXiv Detail & Related papers (2023-10-30T15:25:29Z)
G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D Medical Image Segmentation [8.550528610846456]
We propose a new graph convolution-based decoder namely, Cascaded Graph Convolutional Attention Decoder (G-CASCADE) G-CASCADE progressively refines multi-stage feature maps generated by hierarchical transformer encoders with an efficient graph convolution block. Our decoder can easily be used with other hierarchical encoders for general-purpose semantic and medical image segmentation tasks.
arXiv Detail & Related papers (2023-10-24T20:41:04Z)
Dilated-UNet: A Fast and Accurate Medical Image Segmentation Approach using a Dilated Transformer and U-Net Architecture [0.6445605125467572]
This paper introduces Dilated-UNet, which combines a Dilated Transformer block with the U-Net architecture for accurate and fast medical image segmentation. The results of our experiments show that Dilated-UNet outperforms other models on several challenging medical image segmentation datasets.
arXiv Detail & Related papers (2023-04-22T17:20:13Z)
Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models. We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words. Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z)
Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues. We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z)
Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD) As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks. This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
Atrous Residual Interconnected Encoder to Attention Decoder Framework for Vertebrae Segmentation via 3D Volumetric CT Images [1.8146155083014204]
This paper proposes a novel algorithm for automated vertebrae segmentation via 3D volumetric spine CT images. The proposed model is based on the structure of encoder to decoder, using layer normalization to optimize mini-batch training performance. The experimental results show that our model achieves competitive performance compared with other state-of-the-art medical semantic segmentation methods.
arXiv Detail & Related papers (2021-04-08T12:09:16Z)
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [78.01570371790669]
Medical image segmentation is an essential prerequisite for developing healthcare systems. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard. We propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation.
arXiv Detail & Related papers (2021-02-08T16:10:50Z)
Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. Recent work has proposed to use representations from different encoder layers for diversified levels of information. We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.