Related papers: Embedding principle of homogeneous neural network for classification problem

Embedding principle of homogeneous neural network for classification problem

URL: http://arxiv.org/abs/2505.12419v3
Date: Thu, 23 Oct 2025 15:26:45 GMT
Title: Embedding principle of homogeneous neural network for classification problem
Authors: Jiahan Zhang, Yaoyu Zhang, Tao Luo,
Abstract summary: We study the Karush-Kuhn-Tucker (KKT) points of the associated maximum-margin problem in homogeneous neural networks.<n>We introduce and formalize the textbfKKT point embedding principle, establishing that KKT points of a homogeneous network's max-margin problem can be embedded into the KKT points of a larger network's problem.
Score: 8.954503565223478
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we study the Karush-Kuhn-Tucker (KKT) points of the associated maximum-margin problem in homogeneous neural networks, including fully-connected and convolutional neural networks. In particular, We investigates the relationship between such KKT points across networks of different widths generated. We introduce and formalize the \textbf{KKT point embedding principle}, establishing that KKT points of a homogeneous network's max-margin problem ($P_{\Phi}$) can be embedded into the KKT points of a larger network's problem ($P_{\tilde{\Phi}}$) via specific linear isometric transformations. We rigorously prove this principle holds for neuron splitting in fully-connected networks and channel splitting in convolutional neural networks. Furthermore, we connect this static embedding to the dynamics of gradient flow training with smooth losses. We demonstrate that trajectories initiated from appropriately mapped points remain mapped throughout training and that the resulting $\omega$-limit sets of directions are correspondingly mapped, thereby preserving the alignment with KKT directions dynamically when directional convergence occurs. We conduct several experiments to justify that trajectories are preserved. Our findings offer insights into the effects of network width, parameter redundancy, and the structural connections between solutions found via optimization in homogeneous networks of varying sizes.

Related papers

Exploring the loss landscape of regularized neural networks via convex duality [42.48510370193192]
We discuss several aspects of the loss landscape of regularized neural networks.<n>We first characterize the solution set of the convex problem using its dual and further characterize all stationary points.<n>We show that the solution set characterization and connectivity results can be extended to different architectures.
arXiv Detail & Related papers (2024-11-12T11:41:38Z)
Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks [19.185059111021854]
We study the implicit bias of the general family of steepest descent algorithms with infinitesimal learning rate in deep homogeneous neural networks.<n>We experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of popular adaptive methods.
arXiv Detail & Related papers (2024-10-29T14:28:49Z)
Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks [1.9556053645976448]
This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations. For square loss, neural networks undergo saddle-to-saddle dynamics when close to the origin. Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.
arXiv Detail & Related papers (2024-02-14T15:10:37Z)
Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality. These networks still exhibit fluctuations that grow linearly with the depth of the network. We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z)
Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs. We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural Inverse Kinematics [72.85330210991508]
Inverse kinematic (IK) methods recover the parameters of the joints, given the desired position of selected elements in the kinematic chain. We propose a neural IK method that employs the hierarchical structure of the problem to sequentially sample valid joint angles conditioned on the desired position.
arXiv Detail & Related papers (2022-05-22T14:44:07Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
The edge of chaos: quantum field theory and deep neural networks [0.0]
We explicitly construct the quantum field theory corresponding to a general class of deep neural networks. We compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth $T$ to width $N$. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
arXiv Detail & Related papers (2021-09-27T18:00:00Z)
Deep Networks Provably Classify Data on Curves [12.309532551321334]
We study a model problem that uses a deep fully-connected neural network to classify data drawn from two disjoint smooth curves on the unit sphere. We prove that when (i) the network depth is large to certain properties that set the difficulty of the problem and (ii) the network width and number of samples is intrinsic in the relative depth, randomly-d gradient descent quickly learns to correctly classify all points on the two curves with high probability.
arXiv Detail & Related papers (2021-07-29T20:40:04Z)
A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant. We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z)
Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers. We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set. We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.