Related papers: HAM: A Hyperbolic Step to Regulate Implicit Bias

HAM: A Hyperbolic Step to Regulate Implicit Bias

URL: http://arxiv.org/abs/2506.02630v1
Date: Tue, 03 Jun 2025 08:47:16 GMT
Title: HAM: A Hyperbolic Step to Regulate Implicit Bias
Authors: Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, Rebekka Burkholz,
Abstract summary: We show that HAM (Hyperbolic Minimization) alternates between an overhead step and a new hyperbolic mirror step.<n>Ham's implicit bias consistently boosts performance--even of dense training.<n>Ham is especially effective in combination with different sparsification methods, improving upon the state of the art.
Score: 14.701241300621648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the implicit bias of optimization algorithms has become central to explaining the generalization behavior of deep learning models. For instance, the hyperbolic implicit bias induced by the overparameterization $m \odot w$--though effective in promoting sparsity--can result in a small effective learning rate, which slows down convergence. To overcome this obstacle, we propose HAM (Hyperbolic Aware Minimization), which alternates between an optimizer step and a new hyperbolic mirror step. We derive the Riemannian gradient flow for its combination with gradient descent, leading to improved convergence and a similar beneficial hyperbolic geometry as $m \odot w$ for feature learning. We provide an interpretation of the the algorithm by relating it to natural gradient descent, and an exact characterization of its implicit bias for underdetermined linear regression. HAM's implicit bias consistently boosts performance--even of dense training, as we demonstrate in experiments across diverse tasks, including vision, graph and node classification, and large language model fine-tuning. HAM is especially effective in combination with different sparsification methods, improving upon the state of the art. The hyperbolic step requires minimal computational and memory overhead, it succeeds even with small batch sizes, and its implementation integrates smoothly with existing optimizers.

Related papers

Randomised Splitting Methods and Stochastic Gradient Descent [0.0]
We introduce a new minibatching strategy (called Symmetric Minibatching Strategy) for gradient optimisation.<n>We provide improved convergence guarantees for this new minibatching strategy using Lynov techniques.<n>We argue that this also leads to a faster convergence rate when considering a decreasing stepsize schedule.
arXiv Detail & Related papers (2025-04-05T20:07:34Z)
Optimistic Gradient Learning with Hessian Corrections for High-Dimensional Black-Box Optimization [14.073853819633745]
Black-box algorithms are designed to optimize functions without relying on their underlying analytical structure or gradient information.<n>We propose two novel gradient learning variants to address the challenges posed by high-dimensional, complex, and highly non-linear problems.
arXiv Detail & Related papers (2025-02-07T11:03:50Z)
Linearly Convergent Mixup Learning [0.0]
We present two novel algorithms that extend to a broader range of binary classification models.<n>Unlike gradient-based approaches, our algorithms do not require hyper parameters like learning rates, simplifying their implementation and optimization.<n>Our algorithms achieve faster convergence to the optimal solution compared to descent gradient approaches, and that mixup data augmentation consistently improves the predictive performance across various loss functions.
arXiv Detail & Related papers (2025-01-14T02:33:40Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Robust Hyperbolic Learning with Curvature-Aware Optimization [7.89323764547292]
Current hyperbolic learning approaches are prone to overfitting, computationally expensive, and prone to instability.<n>We introduce a novel fine-tunable hyperbolic scaling approach to constrain hyperbolic embeddings reduce approximation errors.<n>Our approach demonstrates consistent improvements across Computer Vision, EEG classification, and hierarchical metric learning tasks.
arXiv Detail & Related papers (2024-05-22T20:30:14Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.