Related papers: A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

URL: http://arxiv.org/abs/2510.19382v1
Date: Wed, 22 Oct 2025 08:55:00 GMT
Title: A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
Authors: Nikos Tsikouras, Yorgos Pantis, Ioannis Mitliagkas, Christos Tzamos,
Abstract summary: We focus on the structure discovery aspect and study it under weaker assumptions.<n>At the core of our approach is a key $textitderandomization$ lemma.<n>This lemma directly explains structure discovery and has immediate applications in other domains.
Score: 25.592330047318274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g.\ perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $\mathbb{E}_{\mathbf{x}} \left[g_{\theta}(\mathbf{W}\mathbf{x} + \mathbf{b})\right]$ converges to a point where $\mathbf{W} = \mathbf{0}$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.

Related papers

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z)
Joint Learning in the Gaussian Single Index Model [6.3151583550712065]
We consider the problem of jointly learning a one-dimensional projection and a unidimensional function in high-dimensional Gaussian models.<n>Our analysis shows that convergence still occurs even when the initial direction is negatively correlated with the target.<n>On the practical side, we demonstrate that such joint learning can be effectively implemented using a Reproducing Hilbert Kernel Space adapted to the structure of the problem.
arXiv Detail & Related papers (2025-05-27T15:30:34Z)
Enhanced Feature Learning via Regularisation: Integrating Neural Networks and Kernel Methods [0.0]
We consider functions as expectations of Sobolev functions over all possible one-dimensional projections of the data.<n>This framework is similar to kernel ridge regression, where the kernel is $mathbbE_w ( k(B)(wtop x,wtop xprime))$, with $k(B)(a,b) := min(|a|, |b|)mathds1_ab>0$ the Brownian kernel, and the distribution of the projections $w$ is learnt
arXiv Detail & Related papers (2024-07-24T13:46:50Z)
Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping [15.975065054204753]
We study non regression using an over-parametricized two-layer neural networks trained with algorithmic guarantees.<n>We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate.
arXiv Detail & Related papers (2024-07-16T03:38:34Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
A duality framework for analyzing random feature and two-layer neural networks [7.400520323325074]
We consider the problem of learning functions within the $mathcalF_p,pi$ and Barron spaces.<n>We establish a dual equivalence between approximation and estimation, and then apply it to study the learning of the preceding function spaces.
arXiv Detail & Related papers (2023-05-09T17:41:50Z)
Stochastic Zeroth order Descent with Structured Directions [10.604744518360464]
We introduce and analyze Structured Zeroth order Descent (SSZD), a finite difference approach that approximates a gradient on a set $lleq d directions, where $d is the dimension of the ambient space. For convex convex we prove almost sure convergence of functions on $O( (d/l) k-c1/2$)$ for every $c1/2$, which is arbitrarily close to the one of the Gradient Descent (SGD) in terms of one number of iterations.
arXiv Detail & Related papers (2022-06-10T14:00:06Z)
A framework for overparameterized learning [0.0]
An explanation for the success of deep neural networks is a central question in theoretical machine learning. We propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem.
arXiv Detail & Related papers (2022-05-26T17:17:46Z)
The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z)
Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization [98.0595480384208]
We propose a generalization extraient spaces which converges to a stationary point. The algorithm applies not only to general $p$-normed spaces, but also to general $p$-dimensional vector spaces.
arXiv Detail & Related papers (2020-10-31T21:35:42Z)
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions. In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.