Related papers: A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes

A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes

URL: http://arxiv.org/abs/2210.12184v1
Date: Fri, 21 Oct 2022 18:23:12 GMT
Title: A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes
Authors: Oyebade K. Oyedotun and Konstantinos Papadopoulos and Djamila Aouada
Abstract summary: Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. Many works report the progressive loss of model generalization when the training batch size is increased beyond some limits. This is a scenario commonly referred to as generalization gap. Our analysis suggests that large training batch size results in increased near-rank loss of units' activation.
Score: 14.822603738271138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen batch size, available computing resources can be optimally utilized (including parallelization) for fast model training. However, many works report the progressive loss of model generalization when the training batch size is increased beyond some limits. This is a scenario commonly referred to as generalization gap. Although several works have proposed different methods for alleviating the generalization gap problem, a unanimous account for understanding generalization gap is still lacking in the literature. This is especially important given that recent works have observed that several proposed solutions for generalization gap problem such learning rate scaling and increased training budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and provide new perspectives for the source of generalization loss for DNNs trained with a large batch size. Our analysis suggests that large training batch size results in increased near-rank loss of units' activation (i.e. output) tensors, which consequently impacts model optimization and generalization. Extensive experiments are performed for validation on popular DNN models such as VGG-16, residual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST datasets.

Related papers

Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization [47.64583975469164]
We introduce a novel cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Bayes-CAL achieves state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes.
arXiv Detail & Related papers (2025-04-13T06:13:37Z)
Graph neural networks extrapolate out-of-distribution for shortest paths [13.300757448796361]
Graph Neural Networks (GNNs) are trained to minimize a sparsity-regularized loss over a small set of shortest path instances. We show that GNNs trained by gradient descent are able to minimize this loss and extrapolate in practice.
arXiv Detail & Related papers (2025-03-24T21:52:05Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
Enhancing Size Generalization in Graph Neural Networks through Disentangled Representation Learning [7.448831299106425]
DISGEN is a model-agnostic framework designed to disentangle size factors from graph representations. Our empirical results show that DISGEN outperforms the state-of-the-art models by up to 6% on real-world datasets.
arXiv Detail & Related papers (2024-06-07T03:19:24Z)
Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data. We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated. We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z)
Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency. We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training. We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z)
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking [124.21408098724551]
Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs) We present a new ensembling training manner, named EnGCN, to address the existing issues. Our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets.
arXiv Detail & Related papers (2022-10-14T03:43:05Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
Wide Network Learning with Differential Privacy [7.453881927237143]
Current generation of neural networks suffers significant loss accuracy under most practically relevant privacy training regimes. We develop a general approach towards training these models that takes advantage of the sparsity of the gradients of private Empirical Minimization (ERM) Following the same number of parameters, we propose a novel algorithm for privately training neural networks.
arXiv Detail & Related papers (2021-03-01T20:31:50Z)
A Biased Graph Neural Network Sampler with Near-Optimal Regret [57.70126763759996]
Graph neural networks (GNN) have emerged as a vehicle for applying deep network architectures to graph and relational data. In this paper, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem. We introduce a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded payouts.
arXiv Detail & Related papers (2021-03-01T15:55:58Z)
Holistic Filter Pruning for Efficient Deep Neural Networks [25.328005340524825]
"Holistic Filter Pruning" (HFP) is a novel approach for common DNN training that is easy to implement and enables to specify accurate pruning rates. In various experiments, we give insights into the training and achieve state-of-the-art performance on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-09-17T09:23:36Z)
Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks [60.22494363676747]
It is known that the current graph neural networks (GNNs) are difficult to make themselves deep due to the problem known as over-smoothing. Multi-scale GNNs are a promising approach for mitigating the over-smoothing problem. We derive the optimization and generalization guarantees of transductive learning algorithms that include multi-scale GNNs.
arXiv Detail & Related papers (2020-06-15T17:06:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.