Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
- URL: http://arxiv.org/abs/2506.19665v1
- Date: Tue, 24 Jun 2025 14:29:06 GMT
- Title: Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
- Authors: Yuanhe Tian, Lei Mao, Yan Song,
- Abstract summary: We propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling.<n>Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to obtain important visual information.<n>Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models.
- Score: 18.113659670915474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.
Related papers
- CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography [10.110878689623961]
We introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports.<n>We develop CT-CLIP, a CT-focused contrastive language-image pretraining framework.<n>We create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes.
arXiv Detail & Related papers (2024-03-26T16:19:56Z) - Enhancing CT Image synthesis from multi-modal MRI data based on a
multi-task neural network framework [16.864720020158906]
We propose a versatile multi-task neural network framework, based on an enhanced Transformer U-Net architecture.
We decompose the traditional problem of synthesizing CT images into distinct subtasks.
To enhance the framework's versatility in handling multi-modal data, we expand the model with multiple image channels.
arXiv Detail & Related papers (2023-12-13T18:22:38Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Multi-View Vertebra Localization and Identification from CT Images [57.56509107412658]
We propose a multi-view vertebra localization and identification from CT images.
We convert the 3D problem into a 2D localization and identification task on different views.
Our method can learn the multi-view global information naturally.
arXiv Detail & Related papers (2023-07-24T14:43:07Z) - Attentive Symmetric Autoencoder for Brain MRI Segmentation [56.02577247523737]
We propose a novel Attentive Symmetric Auto-encoder based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks.
In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics.
Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models.
arXiv Detail & Related papers (2022-09-19T09:43:19Z) - Self-supervised 3D anatomy segmentation using self-distilled masked
image transformer (SMIT) [2.7298989068857487]
Self-supervised learning has demonstrated success in medical image segmentation using convolutional networks.
We show our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks.
arXiv Detail & Related papers (2022-05-20T17:55:14Z) - Two-Stream Graph Convolutional Network for Intra-oral Scanner Image
Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes.
Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z) - Pathological Retinal Region Segmentation From OCT Images Using Geometric
Relation Based Augmentation [84.7571086566595]
We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape.
The proposed method outperforms state-of-the-art segmentation methods on the public RETOUCH dataset having images captured from different acquisition procedures.
arXiv Detail & Related papers (2020-03-31T11:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.