Mirror Descent Maximizes Generalized Margin and Can Be Implemented
Efficiently
- URL: http://arxiv.org/abs/2205.12808v1
- Date: Wed, 25 May 2022 14:33:13 GMT
- Title: Mirror Descent Maximizes Generalized Margin and Can Be Implemented
Efficiently
- Authors: Haoyuan Sun, Kwangjun Ahn, Christos Thrampoulidis, Navid Azizan
- Abstract summary: We show that $p$-$textsfGD$ can noticeably affect the structure and the generalization performance of the learned models.
We also show that $p$-$textsfGD$ is fully parallel in the same manner as SGD and can be used to train deep neural networks.
- Score: 34.438887960077025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driven by the empirical success and wide use of deep neural networks,
understanding the generalization performance of overparameterized models has
become an increasingly popular question. To this end, there has been
substantial effort to characterize the implicit bias of the optimization
algorithms used, such as gradient descent (GD), and the structural properties
of their preferred solutions. This paper answers an open question in this
literature: For the classification setting, what solution does mirror descent
(MD) converge to? Specifically, motivated by its efficient implementation, we
consider the family of mirror descent algorithms with potential function chosen
as the $p$-th power of the $\ell_p$-norm, which is an important generalization
of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we
characterize the solutions it obtains and show that it converges in direction
to a generalized maximum-margin solution with respect to the $\ell_p$-norm for
linearly separable classification. While the MD update rule is in general
expensive to compute and perhaps not suitable for deep learning,
$p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be
used to train deep neural networks with virtually no additional computational
overhead. Using comprehensive experiments with both linear and deep neural
network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the
structure and the generalization performance of the learned models.
Related papers
- Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization [71.35604981129838]
Traditional gradient-based bi-level optimization algorithms are ill-suited to meet the demands of large-scale applications.
We introduce $(textFG)2textU$, which achieves an unbiased approximation of the meta gradient for bi-level optimization.
$(textFG)2textU$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems.
arXiv Detail & Related papers (2024-06-20T08:21:52Z) - Matrix Completion via Nonsmooth Regularization of Fully Connected Neural Networks [7.349727826230864]
It has been shown that enhanced performance could be attained by using nonlinear estimators such as deep neural networks.
In this paper, we control over-fitting by regularizing FCNN model in terms of norm intermediate representations.
Our simulations indicate the superiority of the proposed algorithm in comparison with existing linear and nonlinear algorithms.
arXiv Detail & Related papers (2024-03-15T12:00:37Z) - A multiobjective continuation method to compute the regularization path of deep neural networks [1.3654846342364308]
Sparsity is a highly feature in deep neural networks (DNNs) since it ensures numerical efficiency, improves the interpretability of models, and robustness.
We present an algorithm that allows for the entire sparse front for the above-mentioned objectives in a very efficient manner for high-dimensional gradients with millions of parameters.
We demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization.
arXiv Detail & Related papers (2023-08-23T10:08:52Z) - Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method [26.874007846077884]
Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus.
We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem.
arXiv Detail & Related papers (2023-05-19T18:41:17Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Inability of a graph neural network heuristic to outperform greedy
algorithms in solving combinatorial optimization problems like Max-Cut [0.0]
In Nature Machine Intelligence 4, 367 (2022), Schuetz et al provide a scheme to employ neural graph networks (GNN) to solve a variety of classical, NP-hard optimization problems.
It describes how the network is trained on sample instances and the resulting GNN is evaluated applying widely used techniques to determine its ability to succeed.
However, closer inspection shows that the reported results for this GNN are only minutely better than those for gradient descent and get outperformed by a greedy algorithm.
arXiv Detail & Related papers (2022-10-02T20:50:33Z) - A framework for overparameterized learning [0.0]
An explanation for the success of deep neural networks is a central question in theoretical machine learning.
We propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems.
We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem.
arXiv Detail & Related papers (2022-05-26T17:17:46Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models.
We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z) - Improving predictions of Bayesian neural nets via local linearization [79.21517734364093]
We argue that the Gauss-Newton approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN)
Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one.
We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation.
arXiv Detail & Related papers (2020-08-19T12:35:55Z) - Stochastic Flows and Geometric Optimization on the Orthogonal Group [52.50121190744979]
We present a new class of geometrically-driven optimization algorithms on the orthogonal group $O(d)$.
We show that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, flows and metric learning.
arXiv Detail & Related papers (2020-03-30T15:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.