Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation
- URL: http://arxiv.org/abs/2506.11105v3
- Date: Thu, 07 Aug 2025 14:57:45 GMT
- Title: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation
- Authors: Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin,
- Abstract summary: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices.<n>We introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors LLMs for deployment in specialized domains.<n>By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance.
- Score: 1.2338220374261344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.
Related papers
- MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z) - Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z) - Pathology Image Compression with Pre-trained Autoencoders [52.208181380986524]
Whole Slide Images in digital histopathology pose significant storage, transmission, and computational efficiency challenges.<n>Standard compression methods, such as JPEG, reduce file sizes but fail to preserve fine-grained phenotypic details critical for downstream tasks.<n>In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images.
arXiv Detail & Related papers (2025-03-14T17:01:17Z) - QuantU-Net: Efficient Wearable Medical Imaging Using Bitwidth as a Trainable Parameter [0.0]
We introduce QuantU-Net, a quantized version of U-Net optimized for efficient deployment on low-power devices.<n>The model achieves an approximately 8x reduction in size, making it suitable for real-time applications in wearable medical devices.
arXiv Detail & Related papers (2025-03-10T16:25:34Z) - Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices [7.229732269884237]
This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices.<n>The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size.<n>The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
arXiv Detail & Related papers (2024-12-12T13:59:21Z) - Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings [7.227964619923918]
We introduce an optimization method for the general-purpose MLLM, TinyLLaVA, which we have adapted and renamed TinyLLaVA-Med.
This adaptation involves instruction-tuning and fine-tuning TinyLLaVA on a medical dataset by drawing inspiration from the LLaVA-Med training pipeline.
Our approach successfully minimizes computational complexity and power consumption, with TinyLLaVA-Med operating at 18.9W and using 11.9GB of memory, while achieving accuracies of 64.54% on VQA-RAD and 70.70% on SLAKE.
arXiv Detail & Related papers (2024-09-02T21:14:16Z) - HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images.
We focus on visual length extrapolation and propose a lightweight model named HRSAM.
The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - MedAide: Leveraging Large Language Models for On-Premise Medical
Assistance on Edge Devices [7.042194397224198]
Large language models (LLMs) are revolutionizing various domains with their remarkable natural language processing (NLP) abilities.
However, deploying LLMs in resource-constrained edge computing and embedded systems presents significant challenges.
These challenges include delivering medical assistance in remote areas with limited healthcare facilities and infrastructure.
arXiv Detail & Related papers (2024-02-28T08:30:49Z) - PEFT-MedAware: Large Language Model for Medical Awareness [0.0]
We propose a specialized PEFT-MedAware model to enhance the Falcon-1b large language model on specialized MedQuAD data.
The model was capable of outperforming other LLMs in medical question-answering tasks in specific domains.
We propose further improvements through expanded datasets, larger models, and feedback mechanisms for sustained medical relevancy.
arXiv Detail & Related papers (2023-11-17T18:32:17Z) - EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.