A New Perspective for Understanding Generalization Gap of Deep Neural
Networks Trained with Large Batch Sizes
- URL: http://arxiv.org/abs/2210.12184v1
- Date: Fri, 21 Oct 2022 18:23:12 GMT
- Title: A New Perspective for Understanding Generalization Gap of Deep Neural
Networks Trained with Large Batch Sizes
- Authors: Oyebade K. Oyedotun and Konstantinos Papadopoulos and Djamila Aouada
- Abstract summary: Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm.
Many works report the progressive loss of model generalization when the training batch size is increased beyond some limits.
This is a scenario commonly referred to as generalization gap.
Our analysis suggests that large training batch size results in increased near-rank loss of units' activation.
- Score: 14.822603738271138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNNs) are typically optimized using various forms of
mini-batch gradient descent algorithm. A major motivation for mini-batch
gradient descent is that with a suitably chosen batch size, available computing
resources can be optimally utilized (including parallelization) for fast model
training. However, many works report the progressive loss of model
generalization when the training batch size is increased beyond some limits.
This is a scenario commonly referred to as generalization gap. Although several
works have proposed different methods for alleviating the generalization gap
problem, a unanimous account for understanding generalization gap is still
lacking in the literature. This is especially important given that recent works
have observed that several proposed solutions for generalization gap problem
such learning rate scaling and increased training budget do not indeed resolve
it. As such, our main exposition in this paper is to investigate and provide
new perspectives for the source of generalization loss for DNNs trained with a
large batch size. Our analysis suggests that large training batch size results
in increased near-rank loss of units' activation (i.e. output) tensors, which
consequently impacts model optimization and generalization. Extensive
experiments are performed for validation on popular DNN models such as VGG-16,
residual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100,
Fashion-MNIST and MNIST datasets.
Related papers
- Enhancing Size Generalization in Graph Neural Networks through Disentangled Representation Learning [7.448831299106425]
DISGEN is a model-agnostic framework designed to disentangle size factors from graph representations.
Our empirical results show that DISGEN outperforms the state-of-the-art models by up to 6% on real-world datasets.
arXiv Detail & Related papers (2024-06-07T03:19:24Z) - Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free
Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data.
We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated.
We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - A Comprehensive Study on Large-Scale Graph Training: Benchmarking and
Rethinking [124.21408098724551]
Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs)
We present a new ensembling training manner, named EnGCN, to address the existing issues.
Our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets.
arXiv Detail & Related papers (2022-10-14T03:43:05Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Wide Network Learning with Differential Privacy [7.453881927237143]
Current generation of neural networks suffers significant loss accuracy under most practically relevant privacy training regimes.
We develop a general approach towards training these models that takes advantage of the sparsity of the gradients of private Empirical Minimization (ERM)
Following the same number of parameters, we propose a novel algorithm for privately training neural networks.
arXiv Detail & Related papers (2021-03-01T20:31:50Z) - A Biased Graph Neural Network Sampler with Near-Optimal Regret [57.70126763759996]
Graph neural networks (GNN) have emerged as a vehicle for applying deep network architectures to graph and relational data.
In this paper, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem.
We introduce a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded payouts.
arXiv Detail & Related papers (2021-03-01T15:55:58Z) - Holistic Filter Pruning for Efficient Deep Neural Networks [25.328005340524825]
"Holistic Filter Pruning" (HFP) is a novel approach for common DNN training that is easy to implement and enables to specify accurate pruning rates.
In various experiments, we give insights into the training and achieve state-of-the-art performance on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-09-17T09:23:36Z) - Optimization and Generalization Analysis of Transduction through
Gradient Boosting and Application to Multi-scale Graph Neural Networks [60.22494363676747]
It is known that the current graph neural networks (GNNs) are difficult to make themselves deep due to the problem known as over-smoothing.
Multi-scale GNNs are a promising approach for mitigating the over-smoothing problem.
We derive the optimization and generalization guarantees of transductive learning algorithms that include multi-scale GNNs.
arXiv Detail & Related papers (2020-06-15T17:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.