Related papers: Online In-Context Distillation for Low-Resource Vision Language Models

Online In-Context Distillation for Low-Resource Vision Language Models

URL: http://arxiv.org/abs/2510.18117v1
Date: Mon, 20 Oct 2025 21:35:17 GMT
Title: Online In-Context Distillation for Low-Resource Vision Language Models
Authors: Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari,
Abstract summary: Small vision-language models (VLMs) are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain.<n>We propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time.<n>Our method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations.
Score: 16.3054668860198
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.

Related papers

On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z)
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation [67.98620973023709]
VOLD is a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.<n>We show that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.
arXiv Detail & Related papers (2025-10-27T16:32:12Z)
Unified Reinforcement and Imitation Learning for Vision-Language Models [84.84277196012907]
Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments.<n>This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs.
arXiv Detail & Related papers (2025-10-22T07:12:14Z)
Multi-MLLM Knowledge Distillation for Out-of-Context News Detection [17.41734069411864]
Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context.<n>We introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM.<n>In Stage 1, we apply LoRA fine-tuning to the student model using all training data.<n>In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict.
arXiv Detail & Related papers (2025-05-28T16:03:41Z)
TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks [15.308801774590597]
The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules.<n>In this work, we investigate this alignment bottleneck through the lens of mutual information.<n>We propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment.
arXiv Detail & Related papers (2025-05-19T09:11:54Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8298782282181865]
We introduce $textitTemporally Adaptive Interpolated Distillation (TAID)$, a novel knowledge distillation approach.<n>We show TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios.<n>These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.
arXiv Detail & Related papers (2025-01-28T13:31:18Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality. We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z)
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.