R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation
- URL: http://arxiv.org/abs/2508.03426v1
- Date: Tue, 05 Aug 2025 13:13:45 GMT
- Title: R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation
- Authors: Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang, Dengdi Sun,
- Abstract summary: We first construct a large-scale multi-modal medical knowledge graph based on the ground truth medical report using the GPT-4o.<n>For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention.<n>Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions.
- Score: 6.887661152518633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
Related papers
- Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation [15.468023420115431]
We show how MLLMs may be enhanced to support Visual RAG, a retrieval-augmented generation framework.<n>On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing.
arXiv Detail & Related papers (2025-02-20T20:55:34Z) - Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation [54.631356899598956]
We propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports.<n>We employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information.
arXiv Detail & Related papers (2025-01-07T01:19:48Z) - 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [53.01393667775077]
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine.<n>It covers over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases.<n>Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline.
arXiv Detail & Related papers (2024-08-06T02:09:35Z) - DeViDe: Faceted medical knowledge for improved medical vision-language pre-training [1.6567372257085946]
Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports.
We propose DeViDe, a transformer-based method that leverages radiographic descriptions from the open web.
DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources.
In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets.
arXiv Detail & Related papers (2024-04-04T17:40:06Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [72.8965643836841]
We introduce XrayGPT, a novel conversational medical vision-language model.<n>It can analyze and answer open-ended questions about chest radiographs.<n>We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report
Generation [92.73584302508907]
We propose a knowledge graph with Dynamic structure and nodes to facilitate medical report generation with Contrastive Learning.
In detail, the fundamental structure of our graph is pre-constructed from general knowledge.
Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation.
arXiv Detail & Related papers (2023-03-18T03:53:43Z) - AlignTransformer: Hierarchical Alignment of Visual Regions and Disease
Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z) - XRayGAN: Consistency-preserving Generation of X-ray Images from
Radiology Reports [19.360283053558604]
We develop methods to generate view-consistent, high-fidelity, and high-resolution X-ray images from radiology reports.
This work represents the first one generating consistent and high-resolution X-ray images from radiology reports.
arXiv Detail & Related papers (2020-06-17T05:32:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.