Related papers: Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

URL: http://arxiv.org/abs/2411.15539v1
Date: Sat, 23 Nov 2024 12:25:06 GMT
Title: Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Authors: Zhixuan Chen, Yequan Bie, Haibo Jin, Hao Chen,
Abstract summary: Existing methods primarily only consider the global features of the entire volume. We propose Reg2RG, the first region-guided referring and grounding framework for CT report generation.
Score: 4.804660464589285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.

Related papers

MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization [19.70803794316208]
Medical Image Grounding (MIG) involves localizing specific regions in medical images based on textual descriptions.<n>Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations.<n>We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations.
arXiv Detail & Related papers (2025-07-01T21:51:42Z)
MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT Report Generation [1.6515663221123749]
We propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key innovations.<n>First, we introduce Region Representative ($R2$) Token Pooling, which utilizes a 2D-wise pretrained vision model to efficiently extract 3D CT features.<n>Second, a universal segmentation model generates pseudo-masks, which are then processed by a mask encoder to extract region-centric features.<n>Third, we leverage segmentation results to extract patient-specific attributions, including organ size, diameter, and locations.
arXiv Detail & Related papers (2025-06-29T06:08:55Z)
SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance [10.075820470715374]
We propose a self-guided segmentation framework (SGSeg) that leverages language guidance for training (multi-modal) while enabling text-free inference (uni-modal) We exploit the critical location information of both pulmonary and pathological structures depicted in the text reports and introduce a novel localization-enhanced report generation (LERG) module to generate clinical reports for self-guidance. Our LERG integrates an object detector and a location-based attention aggregator, weakly-supervised by a location-aware pseudo-label extraction module.
arXiv Detail & Related papers (2024-09-07T08:16:00Z)
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship. We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z)
Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation [36.343753593390254]
This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction. MRANet visually grounds region-specific descriptions, providing robust anatomical regions with a completion strategy. A cross LLMs alignment is employed to enhance the image-to-text transfer process, resulting in sentences rich with clinical detail and improved explainability for radiologist.
arXiv Detail & Related papers (2024-05-23T02:41:08Z)
HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction [16.060286162384536]
HistGen is a learning-empowered framework for histopathology report generation. It aims to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin.
arXiv Detail & Related papers (2024-03-08T15:51:43Z)
RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding. We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z)
DETR Doesn't Need Multi-Scale or Locality Design [69.56292005230185]
This paper presents an improved DETR detector that maintains a "plain" nature. It uses a single-scale feature map and global cross-attention calculations without specific locality constraints. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints.
arXiv Detail & Related papers (2023-08-03T17:59:04Z)
Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance. Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas. We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
Self adaptive global-local feature enhancement for radiology report generation [10.958641951927817]
We propose a novel framework AGFNet to dynamically fuse the global and anatomy region feature to generate multi-grained radiology report. Firstly, we extract important anatomy region features and global features of input Chest X-ray (CXR) Then, with the region features and the global features as input, our proposed self-adaptive fusion gate module could dynamically fuse multi-granularity information. Finally, the captioning generator generates the radiology reports through multi-granularity features.
arXiv Detail & Related papers (2022-11-21T11:50:42Z)
Region-Based Semantic Factorization in GANs [67.90498535507106]
We present a highly efficient algorithm to factorize the latent semantics learned by Generative Adversarial Networks (GANs) concerning an arbitrary image region. Through an appropriately defined generalized Rayleigh quotient, we solve such a problem without any annotations or training. Experimental results on various state-of-the-art GAN models demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-02-19T17:46:02Z)
Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks. Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z)
PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space. Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns. ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
LRC-Net: Learning Discriminative Features on Point Clouds by Encoding Local Region Contexts [65.79931333193016]
We present a novel Local-Region-Context Network (LRC-Net) to learn discriminative features on point clouds. LRC-Net encodes fine-grained contexts inside and among local regions simultaneously. Results show LRC-Net is competitive with state-of-the-art methods in shape classification and shape segmentation applications.
arXiv Detail & Related papers (2020-03-18T14:34:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.