Related papers: Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

URL: http://arxiv.org/abs/2512.02438v1
Date: Tue, 02 Dec 2025 05:53:51 GMT
Title: Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources
Authors: Phuc Pham, Nhu Pham, Ngoc Quoc Ly,
Abstract summary: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs)<n>We focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation.<n>Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .

Related papers

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding [53.18433310890516]
Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings.<n>We propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning.
arXiv Detail & Related papers (2025-11-11T17:23:02Z)
The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback [51.144727949988436]
Reinforcement learning (RL) has demonstrated potential to enhance the reasoning capabilities of large language models (LLMs)<n>In this work, we explore improving LLMs through RL with minimal data.<n>To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness.
arXiv Detail & Related papers (2025-10-03T06:32:10Z)
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z)
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking [26.80161478380058]
Large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters.<n>This growth is accompanied by substantial computational challenges, particularly regarding the memory and compute resources required for training and fine-tuning.<n>Motivated by this issue, we aim to address the following questions: Can parameter- or memory-efficient methods enhance pre-training efficiency while achieving performance comparable to full-model training?
arXiv Detail & Related papers (2025-05-28T22:51:43Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination [28.061239778773423]
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks.<n>CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources.<n>We introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model.
arXiv Detail & Related papers (2024-08-18T11:23:21Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications [3.2549142515720044]
We advocate for leveraging vector embeddings to enable flexible and efficient computational methodologies. Our paper investigates the efficiency of using vector embeddings from single-modal foundation models and multi-modal Vision-Language Models. We propose a simple yet effective inference-time method to enhance performance by aligning image-text embeddings.
arXiv Detail & Related papers (2024-06-02T01:13:01Z)
Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation [11.778576032848482]
This work enhances scientific language models by improving their inference and evaluation capabilities with minimal or no additional training.<n>We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models.<n>We propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain.
arXiv Detail & Related papers (2024-05-22T20:40:53Z)
Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning [17.690698736544626]
We propose an integrated approach that combines knowledge distillation and semi-supervised learning methods. This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data. The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model.
arXiv Detail & Related papers (2024-02-07T22:50:47Z)
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting [66.45372974713189]
We propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks. Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark. We provide open-source RecAdam, which integrates the proposed mechanisms into Adam to facility the NLP community.
arXiv Detail & Related papers (2020-04-27T08:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.