Deep Networks Learn Deep Hierarchical Models
- URL: http://arxiv.org/abs/2601.00455v1
- Date: Thu, 01 Jan 2026 19:44:53 GMT
- Title: Deep Networks Learn Deep Hierarchical Models
- Authors: Amit Daniely,
- Abstract summary: We consider supervised learning with $n$ labels and show that layerwise residual networks can efficiently learn a class of hierarchical models.<n>We argue that the mere existence of human teachers" supports the hypothesis that hierarchical structures are inherently available.
- Score: 9.594432031144718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human ``teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal ``hints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.
Related papers
- Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining [58.69016084278948]
We consider a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks.<n>We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class.
arXiv Detail & Related papers (2026-01-27T16:19:54Z) - Learning with the $p$-adics [26.431600220740354]
We study the suitability of a radically different field as an alternative to $mathbbR$ -- the ultrametric and non-archimedean space of $p$-adic numbers, $mathbbQ_p$.<n>The hierarchical structure of the $p$-adics and their interpretation as infinite strings make them an appealing tool for code theory and hierarchical representation learning.
arXiv Detail & Related papers (2025-12-27T19:40:42Z) - Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z) - Beyond Softmax: A Natural Parameterization for Categorical Random Variables [61.709831225296305]
We introduce the $textitcatnat$ function, a function composed of a sequence of hierarchical binary splits.<n>A rich set of experiments show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
arXiv Detail & Related papers (2025-09-29T12:55:50Z) - Understanding Aggregations of Proper Learners in Multiclass Classification [4.422219522591412]
Multiclass learnability is known to exhibit a properness barrier.<n>Recent advances in binary classification have demonstrated that this requirement can be satisfied using aggregations of proper learners.<n>We show that a single ERM requires $Omega left(fracd_G ln (1 / delta)epsilonright)$ samples.
arXiv Detail & Related papers (2024-10-30T07:12:02Z) - When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models [61.363259848264725]
Inheritune is a simple and effective training recipe for building smaller, more efficient language models.<n>We show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - Logical Entity Representation in Knowledge-Graphs for Differentiable
Rule Learning [71.05093203007357]
We propose Logical Entity RePresentation (LERP) to encode contextual information of entities in the knowledge graph.
A LERP is designed as a vector of probabilistic logical functions on the entity's neighboring sub-graph.
Our model outperforms other rule learning methods in knowledge graph completion and is comparable or even superior to state-of-the-art black-box methods.
arXiv Detail & Related papers (2023-05-22T05:59:22Z) - Self-Attention Networks Can Process Bounded Hierarchical Languages [24.75432474021856]
We prove that self-attention networks can process $mathsfDyck_k, D$, a subset of $mathsfDyck_k, D$ with depth bounded by $D$.
Experiments show that self-attention networks trained on $mathsfDyck_k, D$ generalize to longer inputs with near-perfect accuracy.
arXiv Detail & Related papers (2021-05-24T06:42:58Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.