Learning Curves for Sequential Training of Neural Networks:
Self-Knowledge Transfer and Forgetting
- URL: http://arxiv.org/abs/2112.01653v1
- Date: Fri, 3 Dec 2021 00:25:01 GMT
- Title: Learning Curves for Sequential Training of Neural Networks:
Self-Knowledge Transfer and Forgetting
- Authors: Ryo Karakida and Shotaro Akaho
- Abstract summary: We consider neural networks in the neural tangent kernel regime that continually learn target functions from task to task.
We investigate a variant of continual learning where the model learns the same target function in multiple tasks.
Even for the same target, the trained model shows some transfer and forgetting depending on the sample size of each task.
- Score: 9.734033555407406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential training from task to task is becoming one of the major objects in
deep learning applications such as continual learning and transfer learning.
Nevertheless, it remains unclear under what conditions the trained model's
performance improves or deteriorates. To deepen our understanding of sequential
training, this study provides a theoretical analysis of generalization
performance in a solvable case of continual learning. We consider neural
networks in the neural tangent kernel (NTK) regime that continually learn
target functions from task to task, and investigate the generalization by using
an established statistical mechanical analysis of kernel ridge-less regression.
We first show characteristic transitions from positive to negative transfer.
More similar targets above a specific critical value can achieve positive
knowledge transfer for the subsequent task while catastrophic forgetting occurs
even with very similar targets. Next, we investigate a variant of continual
learning where the model learns the same target function in multiple tasks.
Even for the same target, the trained model shows some transfer and forgetting
depending on the sample size of each task. We can guarantee that the
generalization error monotonically decreases from task to task for equal sample
sizes while unbalanced sample sizes deteriorate the generalization. We
respectively refer to these improvement and deterioration as self-knowledge
transfer and forgetting, and empirically confirm them in realistic training of
deep neural networks as well.
Related papers
- ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning [59.08197876733052]
Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks.
Sometimes, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, known as negative transfer.
ForkMerge is a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights.
arXiv Detail & Related papers (2023-01-30T02:27:02Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - On the Convergence of Shallow Neural Network Training with Randomly
Masked Neurons [11.119895959906085]
Given a dense shallow neural network, we focus on creating, training, and combining randomly selected functions.
By analyzing $i)$ theworks' neural kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error.
For fixed neuron selection probability, the error term decreases as we increase the number of surrogate models, and increases as we increase the number of local training steps.
arXiv Detail & Related papers (2021-12-05T19:51:14Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - A study on the plasticity of neural networks [21.43675319928863]
We discuss the implication of losing plasticity for continual learning.
We show that a pretrained model on data from the same distribution as the one it is fine-tuned on might not reach the same generalisation as a freshly initialised model.
arXiv Detail & Related papers (2021-05-31T18:21:06Z) - Unsupervised Transfer Learning for Spatiotemporal Predictive Networks [90.67309545798224]
We study how to transfer knowledge from a zoo of unsupervisedly learned models towards another network.
Our motivation is that models are expected to understand complex dynamics from different sources.
Our approach yields significant improvements on three benchmarks fortemporal prediction, and benefits the target even from less relevant ones.
arXiv Detail & Related papers (2020-09-24T15:40:55Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z) - Minimax Lower Bounds for Transfer Learning with Linear and One-hidden
Layer Neural Networks [27.44348371795822]
We develop a statistical minimax framework to characterize the limits of transfer learning.
We derive a lower-bound for the target generalization error achievable by any algorithm as a function of the number of labeled source and target data.
arXiv Detail & Related papers (2020-06-16T22:49:26Z) - Inter- and Intra-domain Knowledge Transfer for Related Tasks in Deep
Character Recognition [2.320417845168326]
Pre-training a deep neural network on the ImageNet dataset is a common practice for training deep learning models.
The technique of pre-training on one task and then retraining on a new one is called transfer learning.
In this paper we analyse the effectiveness of using deep transfer learning for character recognition tasks.
arXiv Detail & Related papers (2020-01-02T14:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.