Related papers: Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

URL: http://arxiv.org/abs/2410.19139v1
Date: Thu, 24 Oct 2024 20:15:45 GMT
Title: Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers
Authors: Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou,
Abstract summary: We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers. Our results show that the scaling of the output layer is crucial to the training dynamics. In both settings, we provide nearly matching upper and lower bounds on the test errors.
Score: 20.25049261035324
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.

Related papers

Towards the Training of Deeper Predictive Coding Neural Networks [53.15874572081944]
Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy process.<n>Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers.<n>We show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers.
arXiv Detail & Related papers (2025-06-30T12:44:47Z)
Layerwise complexity-matched learning yields an improved model of cortical area V2 [12.861402235256207]
Deep neural networks trained end-to-end for object recognition approach human capabilities. We develop a self-supervised training methodology that operates independently on successive layers. We show that our model is better aligned with selectivity properties and neural activity in primate area V2.
arXiv Detail & Related papers (2023-12-18T18:37:02Z)
WLD-Reg: A Data-dependent Within-layer Diversity Regularizer [98.78384185493624]
Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization. We propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage the diversity of the activations within the same layer. We present an extensive empirical study confirming that the proposed approach enhances the performance of several state-of-the-art neural network models in multiple tasks.
arXiv Detail & Related papers (2023-01-03T20:57:22Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization [3.6321778403619285]
Generalization of deep neural networks remains one of the main open problems in machine learning. Early layers generally learn representations relevant to performance on both training data and testing data. Deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data.
arXiv Detail & Related papers (2022-01-28T05:26:32Z)
Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations. This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks. We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z)
Tensor decomposition to Compress Convolutional Layers in Deep Learning [5.199454801210509]
We propose to use CP-decomposition to approximately compress the convolutional layer (CPAC-Conv layer) in deep learning. The contributions of our work could be summarized into three aspects: (1) we adapt CP-decomposition to compress convolutional kernels and derive the expressions of both forward and backward propagations for our proposed CPAC-Conv layer; (2) compared with the original convolutional layer, the proposed CPAC-Conv layer can reduce the number of parameters without decaying prediction performance; and (3) the value of decomposed kernels indicates the significance of the corresponding feature map.
arXiv Detail & Related papers (2020-05-28T02:35:48Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.