Solvable Model for Inheriting the Regularization through Knowledge
Distillation
- URL: http://arxiv.org/abs/2012.00194v2
- Date: Wed, 2 Dec 2020 16:55:14 GMT
- Title: Solvable Model for Inheriting the Regularization through Knowledge
Distillation
- Authors: Luca Saglietti and Lenka Zdeborov\'a
- Abstract summary: We introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distillation.
We show that through KD, the regularization properties of the larger teacher model can be inherited by the smaller student.
We also analyze the double descent phenomenology that can arise in the considered KD setting.
- Score: 2.944323057176686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years the empirical success of transfer learning with neural
networks has stimulated an increasing interest in obtaining a theoretical
understanding of its core properties. Knowledge distillation where a smaller
neural network is trained using the outputs of a larger neural network is a
particularly interesting case of transfer learning. In the present work, we
introduce a statistical physics framework that allows an analytic
characterization of the properties of knowledge distillation (KD) in shallow
neural networks. Focusing the analysis on a solvable model that exhibits a
non-trivial generalization gap, we investigate the effectiveness of KD. We are
able to show that, through KD, the regularization properties of the larger
teacher model can be inherited by the smaller student and that the yielded
generalization performance is closely linked to and limited by the optimality
of the teacher. Finally, we analyze the double descent phenomenology that can
arise in the considered KD setting.
Related papers
- Unraveling Feature Extraction Mechanisms in Neural Networks [10.13842157577026]
We propose a theoretical approach based on Neural Tangent Kernels (NTKs) to investigate such mechanisms.
We reveal how these models leverage statistical features during gradient descent and how they are integrated into final decisions.
We find that while self-attention and CNN models may exhibit limitations in learning n-grams, multiplication-based models seem to excel in this area.
arXiv Detail & Related papers (2023-10-25T04:22:40Z) - A Novel Neural-symbolic System under Statistical Relational Learning [50.747658038910565]
We propose a general bi-level probabilistic graphical reasoning framework called GBPGR.
In GBPGR, the results of symbolic reasoning are utilized to refine and correct the predictions made by the deep learning models.
Our approach achieves high performance and exhibits effective generalization in both transductive and inductive tasks.
arXiv Detail & Related papers (2023-09-16T09:15:37Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness? [0.0]
We study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods.
We show how NTKs allow to generate adversarial examples in a training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the lazy'' regime.
arXiv Detail & Related papers (2022-10-11T16:11:48Z) - Learning and Generalization in Overparameterized Normalizing Flows [13.074242275886977]
Normalizing flows (NFs) constitute an important class of models in unsupervised learning.
We provide theoretical and empirical evidence that for a class of NFs containing most of the existing NF models, overparametrization hurts training.
We prove that unconstrained NFs can efficiently learn any reasonable data distribution under minimal assumptions when the underlying network is overparametrized.
arXiv Detail & Related papers (2021-06-19T17:11:42Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Geometry Perspective Of Estimating Learning Capability Of Neural
Networks [0.0]
The paper considers a broad class of neural networks with generalized architecture performing simple least square regression with gradient descent (SGD)
The relationship between the generalization capability with the stability of the neural network has also been discussed.
By correlating the principles of high-energy physics with the learning theory of neural networks, the paper establishes a variant of the Complexity-Action conjecture from an artificial neural network perspective.
arXiv Detail & Related papers (2020-11-03T12:03:19Z) - Deep Knowledge Tracing with Learning Curves [0.9088303226909278]
We propose a Convolution-Augmented Knowledge Tracing (CAKT) model in this paper.
The model employs three-dimensional convolutional neural networks to explicitly learn a student's recent experience on applying the same knowledge concept with that in the next question.
CAKT achieves the new state-of-the-art performance in predicting students' responses compared with existing models.
arXiv Detail & Related papers (2020-07-26T15:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.