Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft
Committee Machines
- URL: http://arxiv.org/abs/2104.14546v1
- Date: Thu, 29 Apr 2021 17:55:58 GMT
- Title: Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft
Committee Machines
- Authors: Frederieke Richert, Roman Worschech, Bernd Rosenow
- Abstract summary: Over-parametrized deep neural networks trained by gradient descent are successful in performing many tasks of practical relevance.
In the context of a student-teacher scenario, this corresponds to the so-called over-realizable case.
For on-line learning of a two-layer soft committee machine in the over-realizable case, we find that the approach to perfect learning occurs in a power-law fashion.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over-parametrized deep neural networks trained by stochastic gradient descent
are successful in performing many tasks of practical relevance. One aspect of
over-parametrization is the possibility that the student network has a larger
expressivity than the data generating process. In the context of a
student-teacher scenario, this corresponds to the so-called over-realizable
case, where the student network has a larger number of hidden units than the
teacher. For on-line learning of a two-layer soft committee machine in the
over-realizable case, we find that the approach to perfect learning occurs in a
power-law fashion rather than exponentially as in the realizable case. All
student nodes learn and replicate one of the teacher nodes if teacher and
student outputs are suitably rescaled.
Related papers
- Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks.
We show that the networks acquire strong, data-dependent features.
Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z) - RdimKD: Generic Distillation Paradigm by Dimensionality Reduction [16.977144350795488]
Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices.
In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD)
RdimKD solely relies on dimensionality reduction, with a very minor modification to naive L2 loss.
arXiv Detail & Related papers (2023-12-14T07:34:08Z) - How a student becomes a teacher: learning and forgetting through
Spectral methods [1.1470070927586018]
In theoretical ML, the teacher paradigm is often employed as an effective metaphor for real-life tuition.
In this work, we take a leap forward by proposing a radically different optimization scheme.
Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher.
arXiv Detail & Related papers (2023-10-19T09:40:30Z) - Online Learning for the Random Feature Model in the Student-Teacher
Framework [0.0]
We study over-parametrization in the context of a student-teacher framework.
For any finite ratio of hidden layer size and input dimension, the student cannot generalize perfectly.
Only when the student's hidden layer size is exponentially larger than the input dimension, an approach to perfect generalization is possible.
arXiv Detail & Related papers (2023-03-24T15:49:02Z) - UNIKD: UNcertainty-filtered Incremental Knowledge Distillation for Neural Implicit Representation [48.49860868061573]
Recent neural implicit representations (NIRs) have achieved great success in the tasks of 3D reconstruction and novel view synthesis.
They require the images of a scene from different camera views to be available for one-time training.
This is expensive especially for scenarios with large-scale scenes and limited data storage.
We design a student-teacher framework to mitigate the catastrophic problem.
arXiv Detail & Related papers (2022-12-21T11:43:20Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - On Learnability via Gradient Method for Two-Layer ReLU Neural Networks
in Teacher-Student Setting [41.60125423028092]
We study two-layer ReLU networks in a teacher-student regression model.
We show that with a specific regularization and sufficient over- parameterization, a student network can identify the parameters via descent.
We analyze the global minima on a sparse global property in the measure space.
arXiv Detail & Related papers (2021-06-11T09:05:41Z) - All at Once Network Quantization via Collaborative Knowledge Transfer [56.95849086170461]
We develop a novel collaborative knowledge transfer approach for efficiently training the all-at-once quantization network.
Specifically, we propose an adaptive selection strategy to choose a high-precision enquoteteacher for transferring knowledge to the low-precision student.
To effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network.
arXiv Detail & Related papers (2021-03-02T03:09:03Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.