Related papers: PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

URL: http://arxiv.org/abs/2506.00910v2
Date: Wed, 01 Oct 2025 01:14:57 GMT
Title: PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models
Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Dongseop Kim, Sung Ju Hwang,
Abstract summary: We introduce ActiveKD, a framework that integrates active learning with knowledge distillation.<n>A key aspect of ActiveKD is the structured prediction bias of large vision-language models (VLMs)<n>We propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space.
Score: 44.421994768941126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by transferring the knowledge from teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL operates in data-scarce scenarios where task-specific teacher models are often unavailable. In this paper, we first introduce ActiveKD, a framework that integrates AL with KD by leveraging the zero- and few-shot capabilities of large vision-language models (VLMs). A key aspect of ActiveKD is the structured prediction bias of VLMs-i.e., their predictions form clusters in the probability space. We regard this structure as an inductive bias of the teacher model, capturing generalizable output patterns beneficial to student learning. To exploit this bias, we propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space. PCoreSet strategically selects probabilistically diverse unlabeled samples, facilitating more efficient transfer of teacher knowledge under limited annotation budgets. Extensive evaluations on 11 datasets show that ActiveKD consistently improves performance across selection methods (e.g., +29.07% on ImageNet, averaged over methods). Under ActiveKD, PCoreSet ranks first in 64/73 settings (approximately 87.7%) across 5 student and 3 teacher networks, always achieving the best performance except for first 2 AL rounds. Our code is available at https://github.com/erjui/PCoreSet.

Related papers

Rethinking Selective Knowledge Distillation [21.167064592056196]
It remains unclear which importance signals, selection policies, and their interplay are most effective.<n>We introduce student-entropy-guided position selection (SE-KD) across the class and sample axes.<n>This approach yields complementary efficiency gains that make offline teacher caching feasible.
arXiv Detail & Related papers (2026-02-01T18:58:27Z)
A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models.<n>The current white-box KD framework exhibits two limitations.<n>We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z)
Class Balance Matters to Active Class-Incremental Learning [61.11786214164405]
We aim to start from a pool of large-scale unlabeled data and then annotate the most informative samples for incremental learning.<n>We propose Class-Balanced Selection (CBS) strategy to achieve both class balance and informativeness in chosen samples.<n>Our CBS can be plugged and played into those CIL methods which are based on pretrained models with prompts tunning technique.
arXiv Detail & Related papers (2024-12-09T16:37:27Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP) Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z)
Densely Distilling Cumulative Knowledge for Continual Learning [14.343655566551213]
Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. We propose Dense Knowledge Distillation (DKD) to distill the cumulative knowledge of all the previous tasks. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios.
arXiv Detail & Related papers (2024-05-16T05:37:06Z)
PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning [30.70974942397732]
We propose PromptKD to enable generative language models to transfer student-friendly knowledge. Experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process.
arXiv Detail & Related papers (2024-02-20T09:10:08Z)
Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport [46.91791643660991]
Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments. These models struggle in the wild because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training.
arXiv Detail & Related papers (2024-01-27T19:44:15Z)
One-bit Supervision for Image Classification: Problem, Solution, and Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification. We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm. In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z)
Improving Knowledge Distillation via Regularizing Feature Norm and Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z)
Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery [76.63807209414789]
We challenge the status quo in class-iNCD and propose a learning paradigm where class discovery occurs continuously and truly unsupervisedly. We propose simple baselines, composed of a frozen PTM backbone and a learnable linear classifier, that are not only simple to implement but also resilient under longer learning scenarios.
arXiv Detail & Related papers (2023-03-28T13:47:16Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models [10.941519846908697]
We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
arXiv Detail & Related papers (2021-11-05T14:14:05Z)
Just Label What You Need: Fine-Grained Active Selection for Perception and Prediction through Partially Labeled Scenes [78.23907801786827]
We introduce generalizations that ensure that our approach is both cost-aware and allows for fine-grained selection of examples through partially labeled scenes. Our experiments on a real-world, large-scale self-driving dataset suggest that fine-grained selection can improve the performance across perception, prediction, and downstream planning tasks.
arXiv Detail & Related papers (2021-04-08T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.