Related papers: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

URL: http://arxiv.org/abs/2506.11105v3
Date: Thu, 07 Aug 2025 14:57:45 GMT
Title: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation
Authors: Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin,
Abstract summary: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices.<n>We introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors LLMs for deployment in specialized domains.<n>By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance.
Score: 1.2338220374261344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

Related papers

MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Pathology Image Compression with Pre-trained Autoencoders [52.208181380986524]
Whole Slide Images in digital histopathology pose significant storage, transmission, and computational efficiency challenges.<n>Standard compression methods, such as JPEG, reduce file sizes but fail to preserve fine-grained phenotypic details critical for downstream tasks.<n>In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images.
arXiv Detail & Related papers (2025-03-14T17:01:17Z)
QuantU-Net: Efficient Wearable Medical Imaging Using Bitwidth as a Trainable Parameter [0.0]
We introduce QuantU-Net, a quantized version of U-Net optimized for efficient deployment on low-power devices.<n>The model achieves an approximately 8x reduction in size, making it suitable for real-time applications in wearable medical devices.
arXiv Detail & Related papers (2025-03-10T16:25:34Z)
Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices [7.229732269884237]
This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices.<n>The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size.<n>The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
arXiv Detail & Related papers (2024-12-12T13:59:21Z)
Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings [7.227964619923918]
We introduce an optimization method for the general-purpose MLLM, TinyLLaVA, which we have adapted and renamed TinyLLaVA-Med. This adaptation involves instruction-tuning and fine-tuning TinyLLaVA on a medical dataset by drawing inspiration from the LLaVA-Med training pipeline. Our approach successfully minimizes computational complexity and power consumption, with TinyLLaVA-Med operating at 18.9W and using 11.9GB of memory, while achieving accuracies of 64.54% on VQA-RAD and 70.70% on SLAKE.
arXiv Detail & Related papers (2024-09-02T21:14:16Z)
HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. We focus on visual length extrapolation and propose a lightweight model named HRSAM. The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
MedAide: Leveraging Large Language Models for On-Premise Medical Assistance on Edge Devices [7.042194397224198]
Large language models (LLMs) are revolutionizing various domains with their remarkable natural language processing (NLP) abilities. However, deploying LLMs in resource-constrained edge computing and embedded systems presents significant challenges. These challenges include delivering medical assistance in remote areas with limited healthcare facilities and infrastructure.
arXiv Detail & Related papers (2024-02-28T08:30:49Z)
PEFT-MedAware: Large Language Model for Medical Awareness [0.0]
We propose a specialized PEFT-MedAware model to enhance the Falcon-1b large language model on specialized MedQuAD data. The model was capable of outperforming other LLMs in medical question-answering tasks in specific domains. We propose further improvements through expanded datasets, larger models, and feedback mechanisms for sustained medical relevancy.
arXiv Detail & Related papers (2023-11-17T18:32:17Z)
EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality. On the software side, we evaluate epitomes' latency and energy on PIM accelerators. We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.