Related papers: Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

URL: http://arxiv.org/abs/2410.02116v3
Date: Mon, 31 Mar 2025 19:01:30 GMT
Title: Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks
Authors: Siddharth Joshi, Jiayi Ni, Baharan Mirzasoleiman,
Abstract summary: We propose the first effective DD method for SSL pre-training.<n>Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL.<n>As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders.
Score: 10.932880269282014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data. Code at https://github.com/BigML-CS-UCLA/MKDT.

Related papers

FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data [36.21759320898034]
Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. We propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks.
arXiv Detail & Related papers (2025-04-14T02:54:28Z)
Transferable text data distillation by trajectory matching [27.826518926355295]
The data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set. In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS.
arXiv Detail & Related papers (2025-04-14T02:39:26Z)
Small Scale Data-Free Knowledge Distillation [37.708282211941416]
We propose Small Scale Data-free Knowledge Distillation SSD-KD. SSD-KD balances synthetic samples and a priority sampling function to select proper samples. It can perform distillation training conditioned on an extremely small scale of synthetic samples.
arXiv Detail & Related papers (2024-06-12T05:09:41Z)
On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z)
Self-supervised learning for skin cancer diagnosis with limited training data [0.196629787330046]
Self-supervised learning (SSL) is an alternative to the standard supervised pre-training on ImageNet for scenarios with limited training data. We consider textitfurther SSL pre-training on task-specific datasets, where our implementation is motivated by supervised transfer learning. We find minimal further SSL pre-training on task-specific data can be as effective as large-scale SSL pre-training on ImageNet for medical image classification tasks with limited labelled data.
arXiv Detail & Related papers (2024-01-01T08:11:38Z)
Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z)
Self-omics: A Self-supervised Learning Framework for Multi-omics Cancer Data [4.843654097048771]
Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data. We develop a novel pre-training paradigm that consists of various SSL components. Our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pan-cancer dataset.
arXiv Detail & Related papers (2022-10-03T11:20:12Z)
Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance. It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z)
DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL) Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.