Sparse Training of Neural Networks based on Multilevel Mirror Descent
- URL: http://arxiv.org/abs/2602.03535v1
- Date: Tue, 03 Feb 2026 13:51:45 GMT
- Title: Sparse Training of Neural Networks based on Multilevel Mirror Descent
- Authors: Yannick Lunk, Sebastian J. Scott, Leon Bungert,
- Abstract summary: We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent.<n>We empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks.
- Score: 0.688204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.
Related papers
- ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations [54.886931928255564]
Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning.<n>We propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE)<n>We show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality.
arXiv Detail & Related papers (2026-02-07T10:19:36Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - Hyperpruning: Efficient Search through Pruned Variants of Recurrent Neural Networks Leveraging Lyapunov Spectrum [7.136205674624814]
We introduce an efficient hyperpruning framework, termed LS-based Hyperpruning (LSH)<n>LS-based Hyperpruning reduces search time by an order of magnitude compared to conventional approaches relying on full training.
arXiv Detail & Related papers (2025-06-09T17:49:29Z) - Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning [32.918269107547616]
Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks.<n>Recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%.<n>We propose a collection of techniques that enable the continuous learning of networks without accuracy collapse even at extreme sparsities.
arXiv Detail & Related papers (2024-11-20T18:54:53Z) - Complexity-Aware Training of Deep Neural Networks for Optimal Structure Discovery [0.0]
We propose a novel algorithm for combined unit and layer pruning of deep neural networks that functions during training and without requiring a pre-trained network to apply.<n>Our algorithm optimally trades-off learning accuracy and pruning levels while balancing layer vs. unit pruning and computational vs. parameter complexity.<n>We show that the proposed algorithm converges to solutions of the optimization problem corresponding to networks.
arXiv Detail & Related papers (2024-11-14T02:00:22Z) - Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact Subproblem Solver for Training Structured Neural Network [9.48424754175943]
We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) for training structured neural networks.<n>We show that RAMDA attains the ideal structure induced by the regularizer at the stationary point of convergence.<n>This structure is locally optimal near the point of convergence, so RAMDA is guaranteed to obtain the best structure possible.
arXiv Detail & Related papers (2024-03-21T13:43:49Z) - Training Bayesian Neural Networks with Sparse Subspace Variational
Inference [35.241207717307645]
Sparse Subspace Variational Inference (SSVI) is the first fully sparse BNN framework that maintains a consistently highly sparse model throughout the training and inference phases.
Our experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20x compression in model size with under 3% performance drop, and up to 20x FLOPs reduction during training compared with dense VI training.
arXiv Detail & Related papers (2024-02-16T19:15:49Z) - Accurate Neural Network Pruning Requires Rethinking Sparse Optimization [87.90654868505518]
We show the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks.
We provide new approaches for mitigating this issue for both sparse pre-training of vision models and sparse fine-tuning of language models.
arXiv Detail & Related papers (2023-08-03T21:49:14Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Dynamic Sparse Training with Structured Sparsity [11.778353786208765]
Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training.
We propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity.
We demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256.
arXiv Detail & Related papers (2023-05-03T17:48:55Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - A Bregman Learning Framework for Sparse Neural Networks [1.7778609937758323]
We propose a learning framework based on Bregman iterations to train sparse neural networks.
We derive a baseline algorithm called LinBreg, an accelerated version using momentum, and AdaBreg, which is a Bregmanized generalization of the Adam algorithm.
arXiv Detail & Related papers (2021-05-10T12:56:01Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.