Debiasing Mini-Batch Quadratics for Applications in Deep Learning
- URL: http://arxiv.org/abs/2410.14325v1
- Date: Fri, 18 Oct 2024 09:37:05 GMT
- Title: Debiasing Mini-Batch Quadratics for Applications in Deep Learning
- Authors: Lukas Tatzel, Bálint Mucsányi, Osane Hackel, Philipp Hennig,
- Abstract summary: Quadratic approximations form a fundamental building block of machine learning methods.
When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches.
We show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
- Score: 22.90473935350847
- License:
- Abstract: Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
Related papers
- Off-policy estimation with adaptively collected data: the power of online learning [20.023469636707635]
We consider estimation of a linear functional of the treatment effect using adaptively collected data.
We propose a general reduction scheme that allows one to produce a sequence of estimates for the treatment effect via online learning.
arXiv Detail & Related papers (2024-11-19T10:18:27Z) - Trust-Region Sequential Quadratic Programming for Stochastic Optimization with Random Models [57.52124921268249]
We propose a Trust Sequential Quadratic Programming method to find both first and second-order stationary points.
To converge to first-order stationary points, our method computes a gradient step in each iteration defined by minimizing a approximation of the objective subject.
To converge to second-order stationary points, our method additionally computes an eigen step to explore the negative curvature the reduced Hessian matrix.
arXiv Detail & Related papers (2024-09-24T04:39:47Z) - DF2: Distribution-Free Decision-Focused Learning [53.2476224456902]
Decision-focused learning (DFL) has recently emerged as a powerful approach for predictthen-optimize problems.
Existing end-to-end DFL methods are hindered by three significant bottlenecks: model error, sample average approximation error, and distribution-based parameterization of the expected objective.
We present DF2 -- the first textit-free decision-focused learning method explicitly designed to address these three bottlenecks.
arXiv Detail & Related papers (2023-08-11T00:44:46Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Nonlinear Level Set Learning for Function Approximation on Sparse Data
with Applications to Parametric Differential Equations [6.184270985214254]
"Nonlinear Level set Learning" (NLL) approach is presented for the pointwise prediction of functions which have been sparsely sampled.
The proposed algorithm effectively reduces the input dimension to the theoretical lower bound with minor accuracy loss.
Experiments and applications are presented which compare this modified NLL with the original NLL and the Active Subspaces (AS) method.
arXiv Detail & Related papers (2021-04-29T01:54:05Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Distributed Averaging Methods for Randomized Second Order Optimization [54.51566432934556]
We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a bottleneck.
We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian.
We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems.
arXiv Detail & Related papers (2020-02-16T09:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.