Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets
- URL: http://arxiv.org/abs/2405.17573v1
- Date: Mon, 27 May 2024 18:15:05 GMT
- Title: Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets
- Authors: Arthur Jacot, Alexandre Kaiser,
- Abstract summary: We study Leaky ResNets, which interpolate between ResNets ($tildeLtoinfty$) and Fully-Connected nets ($tildeLtoinfty$)
In the infinite depth limit, we study'representation geodesics' $A_p$: continuous paths in representation space (similar to NeuralODEs)
We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work.
- Score: 58.460298576330835
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study Leaky ResNets, which interpolate between ResNets ($\tilde{L}=0$) and Fully-Connected nets ($\tilde{L}\to\infty$) depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
Related papers
- Mathematical Models of Computation in Superposition [0.9374652839580183]
Superposition poses a serious challenge to mechanistically interpreting current AI systems.
We present mathematical models of emphcomputation in superposition, where superposition is actively helpful for efficiently accomplishing the task.
We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.
arXiv Detail & Related papers (2024-08-10T06:11:48Z) - "Lossless" Compression of Deep Neural Networks: A High-dimensional
Neural Tangent Kernel Approach [49.744093838327615]
We provide a novel compression approach to wide and fully-connected emphdeep neural nets.
Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme.
arXiv Detail & Related papers (2024-03-01T03:46:28Z) - Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian.
We find that certain spectral properties under $mu$P are largely independent of the size of the network.
We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Capacity Bounds for Hyperbolic Neural Network Representations of Latent
Tree Structures [8.28720658988688]
We study the representation capacity of deep hyperbolic neural networks (HNNs) with a ReLU activation function.
We establish the first proof that HNNs can embed any finite weighted tree into a hyperbolic space of dimensiond$ at least equal to $2$.
We find that the network complexity of HNN implementing the graph representation is independent of the representation fidelity/distortion.
arXiv Detail & Related papers (2023-08-18T02:24:32Z) - Polynomial Width is Sufficient for Set Representation with
High-dimensional Features [69.65698500919869]
DeepSets is the most widely used neural network architecture for set representation.
We present two set-element embedding layers: (a) linear + power activation (LP) and (b) linear + exponential activations (LE)
arXiv Detail & Related papers (2023-07-08T16:00:59Z) - The extended star graph as a light-harvesting-complex prototype:
excitonic absorption speedup by peripheral energy defect tuning [0.0]
We study the quantum dynamics of a photo-excitation uniformly distributed at the periphery of an extended star network.
We show that the origin of this speedup takes place in the hybridization of two upper-band excitonic eigenstates.
arXiv Detail & Related papers (2022-10-14T21:21:07Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - On the Banach spaces associated with multi-layer ReLU networks: Function
representation, approximation theory and gradient descent dynamics [8.160343645537106]
We develop Banach spaces for ReLU neural networks of finite depth $L$ and infinite width.
The spaces contain all finite fully connected $L$-layer networks and their $L2$-limiting objects under on the natural path-norm.
Under this norm, the unit ball in the space for $L$-layer networks has low Rademacher complexity and thus favorable properties.
arXiv Detail & Related papers (2020-07-30T17:47:05Z) - Better Depth-Width Trade-offs for Neural Networks through the lens of
Dynamical Systems [24.229336600210015]
Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems.
We improve the existing width lower bounds along several aspects.
A byproduct of our results is that there exists a universal constant characterizing the depth-width trade-offs.
arXiv Detail & Related papers (2020-03-02T11:36:26Z) - Anisotropy-mediated reentrant localization [62.997667081978825]
We consider a 2d dipolar system, $d=2$, with the generalized dipole-dipole interaction $sim r-a$, and the power $a$ controlled experimentally in trapped-ion or Rydberg-atom systems.
We show that the spatially homogeneous tilt $beta$ of the dipoles giving rise to the anisotropic dipole exchange leads to the non-trivial reentrant localization beyond the locator expansion.
arXiv Detail & Related papers (2020-01-31T19:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.