Related papers: The effect of Target Normalization and Momentum on Dying ReLU

The effect of Target Normalization and Momentum on Dying ReLU

URL: http://arxiv.org/abs/2005.06195v1
Date: Wed, 13 May 2020 08:01:13 GMT
Title: The effect of Target Normalization and Momentum on Dying ReLU
Authors: Isac Arnekvist, J. Frederico Carvalho, Danica Kragic and Johannes A. Stork
Abstract summary: We show that unit variance targets are well motivated and that ReLUs die more easily, when target variance approaches zero. We also analyze the gradients of a single-ReLU model to identify saddle points and regions corresponding to dying ReLU.
Score: 22.41606885255209
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimizing parameters with momentum, normalizing data values, and using rectified linear units (ReLUs) are popular choices in neural network (NN) regression. Although ReLUs are popular, they can collapse to a constant function and "die", effectively removing their contribution from the model. While some mitigations are known, the underlying reasons of ReLUs dying during optimization are currently poorly understood. In this paper, we consider the effects of target normalization and momentum on dying ReLUs. We find empirically that unit variance targets are well motivated and that ReLUs die more easily, when target variance approaches zero. To further investigate this matter, we analyze a discrete-time linear autonomous system, and show theoretically how this relates to a model with a single ReLU and how common properties can result in dying ReLU. We also analyze the gradients of a single-ReLU model to identify saddle points and regions corresponding to dying ReLU and how parameters evolve into these regions when momentum is used. Finally, we show empirically that this problem persist, and is aggravated, for deeper models including residual networks.

Related papers

The Resurrection of the ReLU [1.0626574691596062]
We introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures.<n>SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate.<n>We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances performance over convolutional network architectures.
arXiv Detail & Related papers (2025-05-28T07:55:51Z)
Smart Predict-then-Optimize Method with Dependent Data: Risk Bounds and Calibration of Autoregression [7.369846475695131]
We present an autoregressive SPO method directly targeting the optimization problem at the decision stage. We conduct experiments to demonstrate the effectiveness of the SPO+ surrogate compared to the absolute loss and the least squares loss.
arXiv Detail & Related papers (2024-11-19T17:02:04Z)
Rethinking Model Re-Basin and Linear Mode Connectivity [1.1510009152620668]
We decompose re-normalization into rescaling and reshift, uncovering that rescaling plays a crucial role in re-normalization. We identify that the merged model suffers from the issue of activation collapse and magnitude collapse. We propose a new perspective to unify the re-basin and pruning, under which a lightweight yet effective post-pruning technique is derived.
arXiv Detail & Related papers (2024-02-05T17:06:26Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron [121.10338065441417]
We analyze a Perceptron-type algorithm called GLM-tron and provide its dimension-free risk upper bounds for high-dimensional ReLU regression. Our results suggest that GLM-tron might be preferable to SGD for high-dimensional ReLU regression.
arXiv Detail & Related papers (2023-03-03T23:02:23Z)
Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery [33.74925020397343]
Deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. We show that ReLU networks learn simple and sparse models even when the labels are noisy.
arXiv Detail & Related papers (2022-09-30T06:47:15Z)
Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks [45.886537625951256]
We study gradient flow dynamics in the parameter space when training single-neuron ReLU networks. Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well.
arXiv Detail & Related papers (2022-02-11T08:55:58Z)
ReLU Regression with Massart Noise [52.10842036932169]
We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. We focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. We develop an efficient algorithm that achieves exact parameter recovery in this model.
arXiv Detail & Related papers (2021-09-10T02:13:22Z)
Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates [68.09049111171862]
This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates. We formulate the regression-free model updates into a constrained optimization problem. We empirically analyze how model ensemble reduces regression.
arXiv Detail & Related papers (2021-05-07T03:33:00Z)
ALReLU: A different approach on Leaky ReLU activation function to improve Neural Networks Performance [0.0]
The classical ReLU activation function (AF) has been extensively applied in Deep Neural Networks (DNN) The common gradient issues of ReLU pose challenges in applications on academy and industry sectors. The Absolute Leaky ReLU (ALReLU) AF, a variation of LReLU, is proposed as an alternative method to resolve the common 'dying ReLU problem'
arXiv Detail & Related papers (2020-12-11T06:46:42Z)
Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression. The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
Dynamic ReLU [74.973224160508]
We propose dynamic ReLU (DY-ReLU), a dynamic input of parameters which are generated by a hyper function over all in-put elements. Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability. By simply using DY-ReLU for MobileNetV2, the top-1 accuracy on ImageNet classification is boosted from 72.0% to 76.2% with only 5% additional FLOPs.
arXiv Detail & Related papers (2020-03-22T23:45:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.