Related papers: NoMorelization: Building Normalizer-Free Models from a Sample's Perspective

NoMorelization: Building Normalizer-Free Models from a Sample's Perspective

URL: http://arxiv.org/abs/2210.06932v1
Date: Thu, 13 Oct 2022 12:04:24 GMT
Title: NoMorelization: Building Normalizer-Free Models from a Sample's Perspective
Authors: Chang Liu, Yuwen Yang, Yue Ding, Hongtao Lu
Abstract summary: We propose a simple and effective alternative to normalization, which is called "NoMorelization" NoMorelization is composed of two trainable scalars and a zero-centered noise injector. Compared with existing mainstream normalizers, NoMorelization shows the best speed-accuracy trade-off.
Score: 17.027460848621434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The normalizing layer has become one of the basic configurations of deep learning models, but it still suffers from computational inefficiency, interpretability difficulties, and low generality. After gaining a deeper understanding of the recent normalization and normalizer-free research works from a sample's perspective, we reveal the fact that the problem lies in the sampling noise and the inappropriate prior assumption. In this paper, we propose a simple and effective alternative to normalization, which is called "NoMorelization". NoMorelization is composed of two trainable scalars and a zero-centered noise injector. Experimental results demonstrate that NoMorelization is a general component for deep learning and is suitable for different model paradigms (e.g., convolution-based and attention-based models) to tackle different tasks (e.g., discriminative and generative tasks). Compared with existing mainstream normalizers (e.g., BN, LN, and IN) and state-of-the-art normalizer-free methods, NoMorelization shows the best speed-accuracy trade-off.

Related papers

Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation [10.556297392824721]
One-class anomaly detection aims to detect objects that do not belong to a predefined normal class. We consider frozen yet rich feature spaces given by pretrained models and create pseudo-anomalous features with a novel adaptive linear feature perturbation technique. It adapts the noise distribution to each sample applies decaying linear perturbations to feature vectors and further guides the classification process using a contrastive learning objective.
arXiv Detail & Related papers (2025-03-07T15:42:51Z)
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z)
Normality Learning-based Graph Anomaly Detection via Multi-Scale Contrastive Learning [61.57383634677747]
Graph anomaly detection (GAD) has attracted increasing attention in machine learning and data mining. Here, we propose a normality learning-based GAD framework via multi-scale contrastive learning networks (NLGAD for abbreviation) Notably, the proposed algorithm improves the detection performance (up to 5.89% AUC gain) compared with the state-of-the-art methods.
arXiv Detail & Related papers (2023-09-12T08:06:04Z)
HyperInvariances: Amortizing Invariance Learning [10.189246340672245]
Invariance learning is expensive and data intensive for popular neural architectures. We introduce the notion of amortizing invariance learning. This framework can identify appropriate invariances in different downstream tasks and lead to comparable or better test performance.
arXiv Detail & Related papers (2022-07-17T21:40:37Z)
Explicit Regularization in Overparametrized Models via Noise Injection [14.492434617004932]
We show that small perturbations induce explicit regularization for simple finite-dimensional models. We empirically show that the small perturbations lead to better generalization performance than vanilla (stochastic) gradient descent training.
arXiv Detail & Related papers (2022-06-09T17:00:23Z)
Information-Theoretic Generalization Bounds for Iterative Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles. Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Explainable Deep Few-shot Anomaly Detection with Deviation Networks [123.46611927225963]
We introduce a novel weakly-supervised anomaly detection framework to train detection models. The proposed approach learns discriminative normality by leveraging the labeled anomalies and a prior probability. Our model is substantially more sample-efficient and robust, and performs significantly better than state-of-the-art competing methods in both closed-set and open-set settings.
arXiv Detail & Related papers (2021-08-01T14:33:17Z)
Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations [76.85274970052762]
Regularizing distance between embeddings/representations of original samples and augmented counterparts is a popular technique for improving robustness of neural networks. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. We show that the generic approach we identified (squared $ell$ regularized augmentation) outperforms several recent methods, which are each specially designed for one task.
arXiv Detail & Related papers (2020-11-25T22:40:09Z)
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition [34.235007566913396]
We describe an interpretable, symmetric decomposition of the variance into terms associated with the labels. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior. We also analyze the strikingly rich phenomenology that arises.
arXiv Detail & Related papers (2020-11-04T21:04:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.