A Unified Approach to Controlling Implicit Regularization via Mirror
Descent
- URL: http://arxiv.org/abs/2306.13853v2
- Date: Thu, 11 Jan 2024 14:35:10 GMT
- Title: A Unified Approach to Controlling Implicit Regularization via Mirror
Descent
- Authors: Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, Navid Azizan
- Abstract summary: Mirror descent (MD) is a notable generalization of gradient descent (GD)
We show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions.
- Score: 18.536453909759544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the remarkable success of large neural networks, there has been
significant interest in understanding the generalization performance of
over-parameterized models. Substantial efforts have been invested in
characterizing how optimization algorithms impact generalization through their
"preferred" solutions, a phenomenon commonly referred to as implicit
regularization. In particular, it has been argued that gradient descent (GD)
induces an implicit $\ell_2$-norm regularization in regression and
classification problems. However, the implicit regularization of different
algorithms are confined to either a specific geometry or a particular class of
learning problems, indicating a gap in a general approach for controlling the
implicit regularization. To address this, we present a unified approach using
mirror descent (MD), a notable generalization of GD, to control implicit
regularization in both regression and classification settings. More
specifically, we show that MD with the general class of homogeneous potential
functions converges in direction to a generalized maximum-margin solution for
linear classification problems, thereby answering a long-standing question in
the classification setting. Further, we show that MD can be implemented
efficiently and enjoys fast convergence under suitable conditions. Through
comprehensive experiments, we demonstrate that MD is a versatile method to
produce learned models with different regularizers, which in turn have
different generalization performances.
Related papers
- Weakly Convex Regularisers for Inverse Problems: Convergence of Critical Points and Primal-Dual Optimisation [12.455342327482223]
We present a generalised formulation of convergent regularisation in terms of critical points.
We show that this is achieved by a class of weakly convex regularisers.
Applying this theory to learned regularisation, we prove universal approximation for input weakly convex neural networks.
arXiv Detail & Related papers (2024-02-01T22:54:45Z) - Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization [10.009748368458409]
We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity.
Our method enables fully differentiable approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning.
arXiv Detail & Related papers (2023-07-07T13:06:12Z) - Joint Graph Learning and Model Fitting in Laplacian Regularized
Stratified Models [5.933030735757292]
Laplacian regularized stratified models (LRSM) are models that utilize the explicit or implicit network structure of the sub-problems.
This paper shows the importance and sensitivity of graph weights in LRSM, and provably show that the sensitivity can be arbitrarily large.
We propose a generic approach to jointly learn the graph while fitting the model parameters by solving a single optimization problem.
arXiv Detail & Related papers (2023-05-04T06:06:29Z) - Towards Principled Disentanglement for Domain Generalization [90.9891372499545]
A fundamental challenge for machine learning models is generalizing to out-of-distribution (OOD) data.
We first formalize the OOD generalization problem as constrained optimization, called Disentanglement-constrained Domain Generalization (DDG)
Based on the transformation, we propose a primal-dual algorithm for joint representation disentanglement and domain generalization.
arXiv Detail & Related papers (2021-11-27T07:36:32Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Posterior Differential Regularization with f-divergence for Improving
Model Robustness [95.05725916287376]
We focus on methods that regularize the model posterior difference between clean and noisy inputs.
We generalize the posterior differential regularization to the family of $f$-divergences.
Our experiments show that regularizing the posterior differential with $f$-divergence can result in well-improved model robustness.
arXiv Detail & Related papers (2020-10-23T19:58:01Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.