Related papers: Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

URL: http://arxiv.org/abs/2602.17947v1
Date: Fri, 20 Feb 2026 02:52:13 GMT
Title: Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition
Authors: Yubo Zhou, Jun Shu, Junmin Liu, Deyu Meng,
Abstract summary: We propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively.<n>We establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.
Score: 53.68517860700599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient-based hyperparameter optimization (HPO) have emerged recently, leveraging bilevel programming techniques to optimize hyperparameter by estimating hypergradient w.r.t. validation loss. Nevertheless, previous theoretical works mainly focus on reducing the gap between the estimation and ground-truth (i.e., the bias), while ignoring the error due to data distribution (i.e., the variance), which degrades performance. To address this issue, we conduct a bias-variance decomposition for hypergradient estimation error and provide a supplemental detailed analysis of the variance term ignored by previous works. We also present a comprehensive analysis of the error bounds for hypergradient estimation. This facilitates an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set. Inspired by the derived theories, we propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively. Experimental results on tasks including regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that our variance reduction strategy improves hypergradient estimation. To explain the improved performance, we establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.

Related papers

Overtuning in Hyperparameter Optimization [11.91482877988017]
We provide a formal definition of overtuning and distinguish it from related concepts such as meta-overfitting.<n>We conduct a large-scale reanalysis of HPO benchmark data to assess the prevalence and severity of overtuning.<n>Our results show that overtuning is more common than previously assumed, typically mild but occasionally severe.
arXiv Detail & Related papers (2025-06-24T11:49:48Z)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)
Revisiting Essential and Nonessential Settings of Evidential Deep Learning [70.82728812001807]
Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation. We propose Re-EDL, a simplified yet more effective variant of EDL.
arXiv Detail & Related papers (2024-10-01T04:27:07Z)
Error Bounds of Supervised Classification from Information-Theoretic Perspective [0.0]
We explore bounds on the expected risk when using deep neural networks for supervised classification from an information theoretic perspective. We introduce model risk and fitting error, which are derived from further decomposing the empirical risk.
arXiv Detail & Related papers (2024-06-07T01:07:35Z)
Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization [49.73341101297818]
Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT) We study the error of the IFT method and analyze two strategies to reduce this error.
arXiv Detail & Related papers (2024-02-26T17:09:18Z)
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes. We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z)
Exploring the Optimized Value of Each Hyperparameter in Various Gradient Descent Algorithms [0.0]
gradient descent algorithms have been applied to the parameter optimization of several deep learning models with higher accuracies or lower errors. This study proposes an analytical framework for analyzing the mean error of each objective function based on various gradient descent algorithms. The experimental results show that higher efficiency convergences and lower errors can be obtained by the proposed method.
arXiv Detail & Related papers (2022-12-23T12:04:33Z)
Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding [58.73333095047114]
We propose an error-based thresholding mechanism for learned ISTA (LISTA) We show that the proposed EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors.
arXiv Detail & Related papers (2021-12-21T05:07:54Z)
Stability and Generalization of Bilevel Programming in Hyperparameter Optimization [38.93716746097571]
We show that gradient-based algorithms can be better than cross-validation under certain conditions in a theoretical perspective. We prove that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms.
arXiv Detail & Related papers (2021-06-08T08:59:55Z)
Convergence Properties of Stochastic Hypergradients [34.81849268839475]
We study approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk on a large dataset.<n>We provide numerical experiments to support our theoretical analysis and to show the advantage of using hypergradients in practice.
arXiv Detail & Related papers (2020-11-13T20:50:36Z)
Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning. We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems. We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.