Reminding the Incremental Language Model via Data-Free Self-Distillation
- URL: http://arxiv.org/abs/2110.08745v1
- Date: Sun, 17 Oct 2021 07:27:43 GMT
- Title: Reminding the Incremental Language Model via Data-Free Self-Distillation
- Authors: Han Wang, Ruiliu Fu, Chengzhang Li, Xuejun Zhang, Jun Zhou, Yonghong
Yan
- Abstract summary: Incremental language learning with pseudo-data can alleviate catastrophic forgetting in neural networks.
We propose reminding incremental language model via data-free self-distillation (DFSD)
Our DFSD can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
- Score: 26.960750314663294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incremental language learning with pseudo-data can alleviate catastrophic
forgetting in neural networks. However, to obtain better performance, former
methods have higher demands for pseudo-data of the previous tasks. The
performance dramatically decreases when fewer pseudo-data are employed. In
addition, the distribution of pseudo-data gradually deviates from the real data
with the sequential learning of different tasks. The deviation will be greater
with more tasks learned, which results in more serious catastrophic forgetting.
To address these issues, we propose reminding incremental language model via
data-free self-distillation (DFSD), which includes self-distillation based on
the Earth Mover's Distance and hidden data augmentation. By estimating the
knowledge distribution in all layers of GPT-2 and transforming it from teacher
model to student model, the Self-distillation based on the Earth Mover's
Distance can significantly reduce the demand for pseudo-data. Hidden data
augmentation can greatly alleviate the catastrophic forgetting caused by
deviations via modeling the generation of pseudo-data as a hidden data
augmentation process, where each sample is a mixture of all trained task data.
The experimental results demonstrate that our DFSD can exceed the previous
state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
Related papers
- Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Mendata: A Framework to Purify Manipulated Training Data [12.406255198638064]
We propose Mendata, a framework to purify manipulated training data.
Mendata perturbs the training inputs so that they retain their utility but are distributed similarly to the reference data.
We demonstrate the effectiveness of Mendata by applying it to defeat state-of-the-art data poisoning and data tracing techniques.
arXiv Detail & Related papers (2023-12-03T04:40:08Z) - Farzi Data: Autoregressive Data Distillation [34.39112473620335]
We study data distillation for auto-regressive machine learning tasks.
We propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences.
arXiv Detail & Related papers (2023-10-15T23:23:27Z) - A Pre-trained Data Deduplication Model based on Active Learning [13.495903601474819]
"dirty data" problems can significantly limit the effective application of big data.
We propose a pre-trained deduplication model based on active learning.
Our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification.
arXiv Detail & Related papers (2023-07-31T03:56:46Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - Scaling Laws and Interpretability of Learning from Repeated Data [4.3242395495523525]
We train a family of models where most of the data is unique but a small fraction of it is repeated many times.
We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training.
A predictable range of repetition frequency leads to surprisingly severe degradation in performance.
arXiv Detail & Related papers (2022-05-21T02:14:27Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.