Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
- URL: http://arxiv.org/abs/2506.14451v1
- Date: Tue, 17 Jun 2025 12:15:08 GMT
- Title: Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
- Authors: Aditya Shourya, Michel Dumontier, Chang Sun,
- Abstract summary: In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA.<n>We show that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions.<n>We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes.
- Score: 1.0104586293349587
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.
Related papers
- Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography [0.0]
We evaluate radiomics-based and deep learning-based approaches for disease detection in chest radiography.<n>Deep learning models learn directly from image data, while radiomics-based models extract handcrafted features.<n>These findings provide statistically validated, data-driven recommendations for model selection in diagnostic AI.
arXiv Detail & Related papers (2025-04-16T16:54:37Z) - Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling [10.014130930114172]
We introduce a conditional generative AI model designed for virtual clinical trials (VCTs) of radiology AI.<n>By learning the joint distribution of images and anatomical structures, our model enables precise replication of real-world patient populations.<n>We demonstrate meaningful evaluation of radiology AI models through VCTs powered by our synthetic CT study populations.
arXiv Detail & Related papers (2025-02-13T15:53:52Z) - Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - The Relevance Feature and Vector Machine for health applications [0.11538034264098687]
This paper presents a novel model that addresses the challenges of the fat-data problem when dealing with clinical prospective studies.
The model capabilities are tested against state-of-the-art models in several medical datasets with fat-data problems.
arXiv Detail & Related papers (2024-02-11T01:21:56Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs [5.832030105874915]
We model radiological features with a human-interpretable class hierarchy that aligns with the radiological decision process.
Experiments show that model failures highly correlate with ICU imaging conditions and with the inherent difficulty in distinguishing certain types of radiological features.
arXiv Detail & Related papers (2021-07-14T11:37:28Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Single Model Deep Learning on Imbalanced Small Datasets for Skin Lesion
Classification [5.642359877598896]
This paper proposes a novel data augmentation strategy for single model classification of skin lesions based on a small and imbalanced dataset.
Various DCNNs are trained on this dataset to show that the models with moderate complexity outperform the larger models.
By combining Modified RandAugment and Multi-weighted Focal Loss in a single DCNN model, we have achieved the classification accuracy comparable to those of multiple ensembling models on the ISIC 2018 challenge test dataset.
arXiv Detail & Related papers (2021-02-02T03:48:55Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Sampling for Deep Learning Model Diagnosis (Technical Report) [5.8057675678464555]
Black-box nature of deep neural networks is a barrier to adoption in applications such as medical diagnosis.
We develop a novel data sampling technique that produce approximate but accurate results for these model debug queries.
We evaluate our techniques on one standard computer vision and one scientific data set and demonstrate that our sampling technique outperforms a variety of state-of-the-art alternatives in terms of query accuracy.
arXiv Detail & Related papers (2020-02-22T19:24:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.