3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
- URL: http://arxiv.org/abs/2506.11147v1
- Date: Wed, 11 Jun 2025 09:55:42 GMT
- Title: 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
- Authors: Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu,
- Abstract summary: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support.<n>This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans.
- Score: 14.366478737339909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.
Related papers
- Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering [8.185551155349241]
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear.<n>We present DeepTumorVQA, a diagnostic visual question answering benchmark targeting abdominal tumors in CT scans.<n>It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning.
arXiv Detail & Related papers (2025-05-25T00:50:15Z) - Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection [53.2590751089607]
Real-IAD D3 is a high-precision multimodal dataset that incorporates an additional pseudo3D modality generated through photometric stereo.<n>We introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality.<n>Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance.
arXiv Detail & Related papers (2025-04-19T08:05:47Z) - Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering [28.717312557697376]
3D Scene Question Answering represents an interdisciplinary task that integrates 3D visual perception and natural language processing.<n>Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA.<n>This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics.
arXiv Detail & Related papers (2025-02-01T07:01:33Z) - MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training [7.968487067774351]
3D medical image analysis is pivotal in numerous clinical applications.<n>Large-scale vision-language pre-training remains underexplored in 3D medical image analysis.<n>We propose MG-3D, pre-trained on large-scale data (47.1K)
arXiv Detail & Related papers (2024-12-08T09:45:59Z) - E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [23.56751925900571]
The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment.
We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features.
We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity.
Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
arXiv Detail & Related papers (2024-10-18T06:31:40Z) - 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - Super-resolution of biomedical volumes with 2D supervision [84.5255884646906]
Masked slice diffusion for super-resolution exploits the inherent equivalence in the data-generating distribution across all spatial dimensions of biological specimens.
We focus on the application of SliceR to stimulated histology (SRH), characterized by its rapid acquisition of high-resolution 2D images but slow and costly optical z-sectioning.
arXiv Detail & Related papers (2024-04-15T02:41:55Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - Large-scale Long-tailed Disease Diagnosis on Radiology Images [51.453990034460304]
RadDiag is a foundational model supporting 2D and 3D inputs across various modalities and anatomies.
Our dataset, RP3D-DiagDS, contains 40,936 cases with 195,010 scans covering 5,568 disorders.
arXiv Detail & Related papers (2023-12-26T18:20:48Z) - 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in
Radiology [6.851500027718433]
The field of 3D medical image retrieval is still emerging, lacking established evaluation benchmarks, comprehensive datasets, and thorough studies.
This paper introduces a novel benchmark for 3D Medical Image Retrieval (3D-MIR) that encompasses four different anatomies imaged with computed tomography.
Using this benchmark, we explore a diverse set of search strategies that use aggregated 2D slices, 3D volumes, and multi-modal embeddings from popular multi-modal foundation models as queries.
arXiv Detail & Related papers (2023-11-23T00:57:35Z) - Towards Generalist Foundation Model for Radiology by Leveraging
Web-scale 2D&3D Medical Data [66.9359934608229]
This study aims to initiate the development of Radiology Foundation Model, termed as RadFM.
To the best of our knowledge, this is the first large-scale, high-quality, medical visual-language dataset, with both 2D and 3D scans.
We propose a new evaluation benchmark, RadBench, that comprises five tasks, including modality recognition, disease diagnosis, visual question answering, report generation and rationale diagnosis.
arXiv Detail & Related papers (2023-08-04T17:00:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.