Online In-Context Distillation for Low-Resource Vision Language Models
- URL: http://arxiv.org/abs/2510.18117v1
- Date: Mon, 20 Oct 2025 21:35:17 GMT
- Title: Online In-Context Distillation for Low-Resource Vision Language Models
- Authors: Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari,
- Abstract summary: Small vision-language models (VLMs) are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain.<n>We propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time.<n>Our method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations.
- Score: 16.3054668860198
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.
Related papers
- On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z) - VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation [67.98620973023709]
VOLD is a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.<n>We show that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.
arXiv Detail & Related papers (2025-10-27T16:32:12Z) - Unified Reinforcement and Imitation Learning for Vision-Language Models [84.84277196012907]
Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments.<n>This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs.
arXiv Detail & Related papers (2025-10-22T07:12:14Z) - Multi-MLLM Knowledge Distillation for Out-of-Context News Detection [17.41734069411864]
Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context.<n>We introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM.<n>In Stage 1, we apply LoRA fine-tuning to the student model using all training data.<n>In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict.
arXiv Detail & Related papers (2025-05-28T16:03:41Z) - TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks [15.308801774590597]
The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules.<n>In this work, we investigate this alignment bottleneck through the lens of mutual information.<n>We propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment.
arXiv Detail & Related papers (2025-05-19T09:11:54Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8298782282181865]
We introduce $textitTemporally Adaptive Interpolated Distillation (TAID)$, a novel knowledge distillation approach.<n>We show TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios.<n>These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.
arXiv Detail & Related papers (2025-01-28T13:31:18Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.