Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
- URL: http://arxiv.org/abs/2512.22238v1
- Date: Tue, 23 Dec 2025 14:40:38 GMT
- Title: Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
- Authors: Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma,
- Abstract summary: Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding.<n>This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers.<n>We propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning framework.
- Score: 50.619420197124356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.
Related papers
- Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation [0.0]
We propose AdvDistill, a reward-guided dataset distillation framework.<n>We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers.<n>These varying and normally distributed rewards serve as weights when training student models.
arXiv Detail & Related papers (2025-06-25T20:07:47Z) - From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z) - Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs)<n>Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student.<n>We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z) - Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition [24.293448609592147]
Multi-Teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network.<n>This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights.
arXiv Detail & Related papers (2025-02-22T09:31:24Z) - YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework.
It emulates the teacher-student education process to improve the efficacy of model fine-tuning.
Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z) - The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers [14.467509261354458]
In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation.
By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture.
We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy.
arXiv Detail & Related papers (2023-02-21T07:48:34Z) - PrUE: Distilling Knowledge from Sparse Teacher Networks [4.087221125836262]
We present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher.
We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Our method allows researchers to distill knowledge from deeper networks to improve students further.
arXiv Detail & Related papers (2022-07-03T08:14:24Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.