Generalization Through Growth: Hidden Dynamics Controls Depth Dependence
- URL: http://arxiv.org/abs/2505.15064v1
- Date: Wed, 21 May 2025 03:32:30 GMT
- Title: Generalization Through Growth: Hidden Dynamics Controls Depth Dependence
- Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda,
- Abstract summary: We present a unified framework for arbitrary bluepseudo-metric spaces in which a depth-(k) network is the composition of continuous hidden maps (f:mathcalXto mathcalX) and an output map (h:mathcalXto mathbbR).
- Score: 15.67299102925013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent theory has reduced the depth dependence of generalization bounds from exponential to polynomial and even depth-independent rates, yet these results remain tied to specific architectures and Euclidean inputs. We present a unified framework for arbitrary \blue{pseudo-metric} spaces in which a depth-\(k\) network is the composition of continuous hidden maps \(f:\mathcal{X}\to \mathcal{X}\) and an output map \(h:\mathcal{X}\to \mathbb{R}\). The resulting bound $O(\sqrt{(\alpha + \log \beta(k))/n})$ isolates the sole depth contribution in \(\beta(k)\), the word-ball growth of the semigroup generated by the hidden layers. By Gromov's theorem polynomial (resp. exponential) growth corresponds to virtually nilpotent (resp. expanding) dynamics, revealing a geometric dichotomy behind existing $O(\sqrt{k})$ (sublinear depth) and $\tilde{O}(1)$ (depth-independent) rates. We further provide covering-number estimates showing that expanding dynamics yield an exponential parameter saving via compositional expressivity. Our results decouple specification from implementation, offering architecture-agnostic and dynamical-systems-aware guarantees applicable to modern deep-learning paradigms such as test-time inference and diffusion models.
Related papers
- Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond [0.0]
We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error.<n>We show that each coordinate of gradient-based feature vector admits a marginal normal approximation, thereby supporting the validity of FDR control.
arXiv Detail & Related papers (2025-12-04T11:46:06Z) - Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws [21.18373933718468]
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime.<n>We present a sharp analysis of the dynamics in the feature learning regime, for both the population limit and the finite-sample discretization.
arXiv Detail & Related papers (2025-08-05T17:57:56Z) - The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models [71.5283441529015]
In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace.<n>We introduce the generative leap exponent $kstar$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting.
arXiv Detail & Related papers (2025-06-05T18:34:56Z) - Quantum Circuit Encodings of Polynomial Chaos Expansions [5.63729124086755]
We study countably-parametric holomorphic maps $u:Uto mathbbR$, where the parameter domain is $U=[-1,1]mathbbN$.<n>We establish dimension-independent quantum circuit approximation rates via the best $n$-term truncations of generalized chaos (gPC) expansions of these parametric maps.<n>Our results have implications for quantum-enhanced algorithms for a wide range of maps in applications.
arXiv Detail & Related papers (2025-06-02T15:53:36Z) - A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff [57.25901375384457]
We propose a nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions.<n>In particular, it doens't require the boundness of loss function, as commonly assumed in the literature.<n>We show the near minimax optimality of our theory for multilayer ReLU networks for regression problems.
arXiv Detail & Related papers (2025-03-03T23:34:12Z) - How well behaved is finite dimensional Diffusion Maps? [0.0]
We derive a series of properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM)<n>We quantify the error between the estimated tangent spaces and the true tangent spaces over the submanifolds after the DM embedding.<n>These results offer a solid theoretical foundation for understanding the performance and reliability of DM in practical applications.
arXiv Detail & Related papers (2024-12-05T09:12:25Z) - Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks [0.0]
We show that any dataset with $N$ distinct points in $mathbbRd$ and $M$ output classes can be exactly classified.<n>We also prove a universal approximation theorem in $Lp(Omega; mathbbRm)$ for any bounded domain.<n>Our results offer a unified and interpretable framework connecting controllability, expressivity, and training dynamics in deep neural networks.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Absolute abstraction: a renormalisation group approach [0.0]
We argue that depth alone is not enough to develop truly abstract representations.<n>We take the fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract.
arXiv Detail & Related papers (2024-07-01T14:13:11Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Policy Gradient with Tree Expansion [72.10002936187388]
Policy gradient methods are notorious for having a large variance and high sample complexity.<n>We introduce SoftTreeMax -- a generalization of softmax that employs planning.<n>We show that SoftTreeMax reduces the gradient variance by three orders of magnitude.
arXiv Detail & Related papers (2023-01-30T19:03:14Z) - Exponential Separations in Symmetric Neural Networks [48.80300074254758]
We consider symmetric Networkparencitesantoro 2017simple architecture as a natural generalization of DeepSetsparencitezaheerdeep architecture.
Under the restriction to analytic activation functions, we construct a symmetric function acting on sets of dimensions $N$ in dimension with $D$.
arXiv Detail & Related papers (2022-06-02T19:45:10Z) - Non-parametric Depth Distribution Modelling based Depth Inference for
Multi-view Stereo [43.415242967722804]
Recent cost volume pyramid based deep neural networks have unlocked the potential of efficiently leveraging high-resolution images for depth inference from multi-view stereo.
In general, those approaches assume that the depth of each pixel follows a unimodal distribution.
We propose constructing the cost volume by non-parametric depth distribution modeling to handle pixels with unimodal and multi-modal distributions.
arXiv Detail & Related papers (2022-05-08T05:13:04Z) - On the Generalization Mystery in Deep Learning [15.2292571922932]
We argue that the answer to two questions lies in the interaction of the gradients of different examples during training.
We formalize this argument with an easy to compute and interpretable metric for coherence.
The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others.
arXiv Detail & Related papers (2022-03-18T16:09:53Z) - Geometric Graph Representation Learning via Maximizing Rate Reduction [73.6044873825311]
Learning node representations benefits various downstream tasks in graph analysis such as community detection and node classification.
We propose Geometric Graph Representation Learning (G2R) to learn node representations in an unsupervised manner.
G2R maps nodes in distinct groups into different subspaces, while each subspace is compact and different subspaces are dispersed.
arXiv Detail & Related papers (2022-02-13T07:46:24Z) - Exponential Family Model-Based Reinforcement Learning via Score Matching [97.31477125728844]
We propose an optimistic model-based algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL)
SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression.
arXiv Detail & Related papers (2021-12-28T15:51:07Z) - Exponential Convergence of Deep Operator Networks for Elliptic Partial
Differential Equations [0.0]
We construct deep operator networks (ONets) between infinite-dimensional spaces that emulate with an exponential rate of convergence the coefficient-to-solution map of elliptic second-order PDEs.
In particular, we consider problems set in $d$-dimensional periodic domains, $d=1, 2, dots$, and with analytic right-hand sides and coefficients.
We prove that the neural networks in the ONet have size $mathcalO(left|log(varepsilon)right|kappa)$ for some $kappa
arXiv Detail & Related papers (2021-12-15T13:56:28Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - On the Variance of the Fisher Information for Deep Learning [79.71410479830222]
The Fisher information matrix (FIM) has been applied to the realm of deep learning.
The exact FIM is either unavailable in closed form or too expensive to compute.
We investigate two such estimators based on two equivalent representations of the FIM.
arXiv Detail & Related papers (2021-07-09T04:46:50Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Representational aspects of depth and conditioning in normalizing flows [33.4333537858003]
We show that representationally the choice of partition is not a bottleneck for depth.
We also show that shallow affine coupling networks are universal approximators in Wasserstein distance.
arXiv Detail & Related papers (2020-10-02T18:15:45Z) - Better Depth-Width Trade-offs for Neural Networks through the lens of
Dynamical Systems [24.229336600210015]
Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems.
We improve the existing width lower bounds along several aspects.
A byproduct of our results is that there exists a universal constant characterizing the depth-width trade-offs.
arXiv Detail & Related papers (2020-03-02T11:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.