How a student becomes a teacher: learning and forgetting through
Spectral methods
- URL: http://arxiv.org/abs/2310.12612v2
- Date: Fri, 3 Nov 2023 14:53:57 GMT
- Title: How a student becomes a teacher: learning and forgetting through
Spectral methods
- Authors: Lorenzo Giambagli, Lorenzo Buffoni, Lorenzo Chicchi, Duccio Fanelli
- Abstract summary: In theoretical ML, the teacher paradigm is often employed as an effective metaphor for real-life tuition.
In this work, we take a leap forward by proposing a radically different optimization scheme.
Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher.
- Score: 1.1470070927586018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In theoretical ML, the teacher-student paradigm is often employed as an
effective metaphor for real-life tuition. The above scheme proves particularly
relevant when the student network is overparameterized as compared to the
teacher network. Under these operating conditions, it is tempting to speculate
that the student ability to handle the given task could be eventually stored in
a sub-portion of the whole network. This latter should be to some extent
reminiscent of the frozen teacher structure, according to suitable metrics,
while being approximately invariant across different architectures of the
student candidate network. Unfortunately, state-of-the-art conventional
learning techniques could not help in identifying the existence of such an
invariant subnetwork, due to the inherent degree of non-convexity that
characterizes the examined problem. In this work, we take a leap forward by
proposing a radically different optimization scheme which builds on a spectral
representation of the linear transfer of information between layers. The
gradient is hence calculated with respect to both eigenvalues and eigenvectors
with negligible increase in terms of computational and complexity load, as
compared to standard training algorithms. Working in this framework, we could
isolate a stable student substructure, that mirrors the true complexity of the
teacher in terms of computing neurons, path distribution and topological
attributes. When pruning unimportant nodes of the trained student, as follows a
ranking that reflects the optimized eigenvalues, no degradation in the recorded
performance is seen above a threshold that corresponds to the effective teacher
size. The observed behavior can be pictured as a genuine second-order phase
transition that bears universality traits.
Related papers
- What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks.
This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z) - Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales [54.78115855552886]
We show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture.
With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner.
For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
arXiv Detail & Related papers (2024-02-23T16:50:07Z) - The Copycat Perceptron: Smashing Barriers Through Collective Learning [3.55026004901472]
We analyze a general setting in which thermal noise is present that affects each student's generalization performance.
We find that the coupling of replicas leads to a bend of the phase diagram towards smaller values of $alpha$.
These results provide additional analytic and numerical evidence for the recently conjectured Bayes-optimal property of Replicated Simulated Annealing.
arXiv Detail & Related papers (2023-08-07T17:51:09Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Isometric Representations in Neural Networks Improve Robustness [0.0]
We train neural networks to perform classification while simultaneously maintaining within-class metric structure.
We verify that isometric regularization improves the robustness to adversarial attacks on MNIST.
arXiv Detail & Related papers (2022-11-02T16:18:18Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - On the training of sparse and dense deep neural networks: less
parameters, same performance [0.0]
We propose a variant of the spectral learning method as appeared in Giambagli et al Nat. Comm. 2021.
The eigenvalues act as veritable knobs which can be freely tuned so as to (i) enhance, or alternatively silence, the contribution of the input nodes.
Each spectral parameter reflects back on the whole set of inter-nodes weights, an attribute which we shall effectively exploit to yield sparse networks with stunning classification abilities.
arXiv Detail & Related papers (2021-06-17T14:54:23Z) - Graph Consistency based Mean-Teaching for Unsupervised Domain Adaptive
Person Re-Identification [54.58165777717885]
This paper proposes a Graph Consistency based Mean-Teaching (GCMT) method with constructing the Graph Consistency Constraint (GCC) between teacher and student networks.
Experiments on three datasets, i.e., Market-1501, DukeMTMCreID, and MSMT17, show that proposed GCMT outperforms state-of-the-art methods by clear margin.
arXiv Detail & Related papers (2021-05-11T04:09:49Z) - Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft
Committee Machines [0.0]
Over-parametrized deep neural networks trained by gradient descent are successful in performing many tasks of practical relevance.
In the context of a student-teacher scenario, this corresponds to the so-called over-realizable case.
For on-line learning of a two-layer soft committee machine in the over-realizable case, we find that the approach to perfect learning occurs in a power-law fashion.
arXiv Detail & Related papers (2021-04-29T17:55:58Z) - Representation Transfer by Optimal Transport [34.77292648424614]
We use optimal transport to quantify the match between two representations.
This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher.
arXiv Detail & Related papers (2020-07-13T23:42:06Z) - Eigendecomposition-Free Training of Deep Networks for Linear
Least-Square Problems [107.3868459697569]
We introduce an eigendecomposition-free approach to training a deep network.
We show that our approach is much more robust than explicit differentiation of the eigendecomposition.
Our method has better convergence properties and yields state-of-the-art results.
arXiv Detail & Related papers (2020-04-15T04:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.