Related papers: Understanding the Curse of Unrolling

Understanding the Curse of Unrolling

URL: http://arxiv.org/abs/2602.19733v1
Date: Mon, 23 Feb 2026 11:32:39 GMT
Title: Understanding the Curse of Unrolling
Authors: Sheheryar Mehmood, Florian Knoll, Peter Ochs,
Abstract summary: We provide a non-asymptotic analysis that explains the origin of the curse of unrolling.<n>We show that truncating early iterations of the derivative computation mitigates the curse while simultaneously reducing memory requirements.<n>Our theoretical findings are supported by numerical experiments on representative examples.
Score: 2.2452525313322966
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Algorithm unrolling is ubiquitous in machine learning, particularly in hyperparameter optimization and meta-learning, where Jacobians of solution mappings are computed by differentiating through iterative algorithms. Although unrolling is known to yield asymptotically correct Jacobians under suitable conditions, recent work has shown that the derivative iterates may initially diverge from the true Jacobian, a phenomenon known as the curse of unrolling. In this work, we provide a non-asymptotic analysis that explains the origin of this behavior and identifies the algorithmic factors that govern it. We show that truncating early iterations of the derivative computation mitigates the curse while simultaneously reducing memory requirements. Finally, we demonstrate that warm-starting in bilevel optimization naturally induces an implicit form of truncation, providing a practical remedy. Our theoretical findings are supported by numerical experiments on representative examples.

Related papers

Bregman-divergence-based Arimoto-Blahut algorithm [53.64687146666141]
We generalize the Arimoto-Blahut algorithm to a general function defined over Bregman-divergence system.<n>This algorithm can be applied to classical and quantum rate-distortion theory.
arXiv Detail & Related papers (2024-08-10T06:16:24Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
One-step differentiation of iterative algorithms [7.9495796547433395]
We study one-step differentiation, also known as Jacobian-free backpropagation, a method as easy as automatic differentiation. We provide a complete theoretical approximation analysis with specific examples along with its consequences in bilevel optimization.
arXiv Detail & Related papers (2023-05-23T07:32:37Z)
Isotropic Gaussian Processes on Finite Spaces of Graphs [71.26737403006778]
We propose a principled way to define Gaussian process priors on various sets of unweighted graphs. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.
arXiv Detail & Related papers (2022-11-03T10:18:17Z)
The Curse of Unrolling: Rate of Differentiating Through Optimization [35.327233435055305]
Un differentiation approximates the solution using an iterative solver and differentiates it through the computational path. We show that we can either 1) choose a large learning rate leading to a fast convergence but accept that the algorithm may have an arbitrarily long burn-in phase or 2) choose a smaller learning rate leading to an immediate but slower convergence.
arXiv Detail & Related papers (2022-09-27T09:27:29Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings. We exploit a warm-start strategy to amortize the estimation of the exact gradient. By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z)
Generalization Properties of Stochastic Optimizers via Trajectory Analysis [48.38493838310503]
We show that both the Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance. We show that both our Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance.
arXiv Detail & Related papers (2021-08-02T10:58:32Z)
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure. We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z)
Learning to solve TV regularized problems with unrolled algorithms [18.241062505073234]
Total Variation (TV) is a popular regularization strategy that promotes piece-wise constant signals. We develop and characterize two approaches to do so, describe their benefits and limitations, and discuss the regime where they can actually improve over iterative procedures.
arXiv Detail & Related papers (2020-10-19T14:19:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.