Hierarchical Zero-Order Optimization for Deep Neural Networks
- URL: http://arxiv.org/abs/2602.10607v1
- Date: Wed, 11 Feb 2026 07:56:07 GMT
- Title: Hierarchical Zero-Order Optimization for Deep Neural Networks
- Authors: Sansheng Cao, Zhengyu Ma, Yonghong Tian,
- Abstract summary: We propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network.<n>We prove that HZO reduces the query complexity from $O(ML2)$ to $O(ML log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies.
- Score: 33.991611257471114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from $O(ML^2)$ to $O(ML \log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ($L_{lip} \approx 1$). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.
Related papers
- LDLT $\mathcal{L}$-Lipschitz Network: Generalized Deep End-To-End Lipschitz Network Construction [3.744861320984297]
Controlling the Lipschitz constant in neural networks has emerged as an essential area of research to enhance adversarial robustness and network certifiability.<n>This paper presents a rigorous approach to the general design of $mathcalL$-Lipschitz deep residual networks using a Linear Matrix Inequality (LMI) framework.
arXiv Detail & Related papers (2025-12-05T17:51:08Z) - Lighter-X: An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation [49.865020394064096]
We propose textbfLighter-X, an efficient and modular framework that can be seamlessly integrated with existing GNN-based recommender architectures.<n>Our approach substantially reduces both parameter size and computational complexity while preserving the theoretical guarantees and empirical performance of the base models.<n>Experiments demonstrate that Lighter-X achieves comparable performance to baseline models with significantly fewer parameters.
arXiv Detail & Related papers (2025-10-11T08:33:08Z) - Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.<n>Non-smooth regularization is often incorporated into machine learning tasks.<n>We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets.<n>By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients.<n>We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z) - Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis [16.893624100273108]
We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices trained with gradient descent in the supervised learning setting.
We prove that gradient descent can achieve optimality emphwithout massive over parameterization.
Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks.
arXiv Detail & Related papers (2024-02-19T15:56:43Z) - Functional SDE approximation inspired by a deep operator network architecture [0.0]
A novel approach to approximate solutions of Differential Equations (SDEs) by Deep Neural Networks is derived and analysed.<n>The architecture is inspired by notion of Deep Operator Networks (DeepONets), which is based on operator learning in terms of a reduced basis also represented in the network.<n>The proposed SDEONet architecture aims to alleviate the issue of exponential complexity by learning an optimal sparse truncation of the Wiener chaos expansion.
arXiv Detail & Related papers (2024-02-05T14:12:35Z) - Sample Complexity of Neural Policy Mirror Descent for Policy
Optimization on Low-Dimensional Manifolds [75.51968172401394]
We study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN)
In each iteration of NPMD, both the value function and the policy can be well approximated by CNNs.
We show that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality.
arXiv Detail & Related papers (2023-09-25T07:31:22Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Scalable Lipschitz Residual Networks with Convex Potential Flows [120.27516256281359]
We show that using convex potentials in a residual network gradient flow provides a built-in $1$-Lipschitz transformation.
A comprehensive set of experiments on CIFAR-10 demonstrates the scalability of our architecture and the benefit of our approach for $ell$ provable defenses.
arXiv Detail & Related papers (2021-10-25T07:12:53Z) - FactorizeNet: Progressive Depth Factorization for Efficient Network
Architecture Exploration Under Quantization Constraints [93.4221402881609]
We introduce a progressive depth factorization strategy for efficient CNN architecture exploration under quantization constraints.
By algorithmically increasing the granularity of depth factorization in a progressive manner, the proposed strategy enables a fine-grained, low-level analysis of layer-wise distributions.
Such a progressive depth factorization strategy also enables efficient identification of the optimal depth-factorized macroarchitecture design.
arXiv Detail & Related papers (2020-11-30T07:12:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.