Impact of Bottleneck Layers and Skip Connections on the Generalization of Linear Denoising Autoencoders
- URL: http://arxiv.org/abs/2505.24668v1
- Date: Fri, 30 May 2025 14:58:02 GMT
- Title: Impact of Bottleneck Layers and Skip Connections on the Generalization of Linear Denoising Autoencoders
- Authors: Jonghyun Ham, Maximilian Fleissner, Debarghya Ghoshdastidar,
- Abstract summary: We focus on two-layer linear denoising autoencoders trained under gradient flow.<n>A low-dimensional bottleneck layer that effectively enforces a rank constraint on the learned solution.<n> skip connection can mitigate the variance in denoising autoencoders.
- Score: 6.178817969919849
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern deep neural networks exhibit strong generalization even in highly overparameterized regimes. Significant progress has been made to understand this phenomenon in the context of supervised learning, but for unsupervised tasks such as denoising, several open questions remain. While some recent works have successfully characterized the test error of the linear denoising problem, they are limited to linear models (one-layer network). In this work, we focus on two-layer linear denoising autoencoders trained under gradient flow, incorporating two key ingredients of modern deep learning architectures: A low-dimensional bottleneck layer that effectively enforces a rank constraint on the learned solution, as well as the possibility of a skip connection that bypasses the bottleneck. We derive closed-form expressions for all critical points of this model under product regularization, and in particular describe its global minimizer under the minimum-norm principle. From there, we derive the test risk formula in the overparameterized regime, both for models with and without skip connections. Our analysis reveals two interesting phenomena: Firstly, the bottleneck layer introduces an additional complexity measure akin to the classical bias-variance trade-off -- increasing the bottleneck width reduces bias but introduces variance, and vice versa. Secondly, skip connection can mitigate the variance in denoising autoencoders -- especially when the model is mildly overparameterized. We further analyze the impact of skip connections in denoising autoencoder using random matrix theory and support our claims with numerical evidence.
Related papers
- Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [16.99620863197586]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.<n>For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models.
arXiv Detail & Related papers (2025-05-27T17:39:39Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - High-dimensional Asymptotics of Denoising Autoencoders [0.0]
We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection.
We provide closed-form expressions for the denoising mean-squared test error.
arXiv Detail & Related papers (2023-05-18T15:35:11Z) - Fundamental Limits of Two-layer Autoencoders, and Achieving Them with
Gradient Methods [91.54785981649228]
This paper focuses on non-linear two-layer autoencoders trained in the challenging proportional regime.
Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods.
For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders.
arXiv Detail & Related papers (2022-12-27T12:37:34Z) - Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold.
We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples.
We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z) - Robust Training under Label Noise by Over-parameterization [41.03008228953627]
We propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted.
The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data.
Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets.
arXiv Detail & Related papers (2022-02-28T18:50:10Z) - Benign Overfitting without Linearity: Neural Network Classifiers Trained
by Gradient Descent for Noisy Linear Data [44.431266188350655]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.
We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.
In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z) - Bilevel learning of l1-regularizers with closed-form gradients(BLORC) [8.138650738423722]
We present a method for supervised learning of sparsity-promoting regularizers.
The parameters are learned to minimize the mean squared error of reconstruction on a training set of ground truth signal and measurement pairs.
arXiv Detail & Related papers (2021-11-21T17:01:29Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.