Related papers: Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently

URL: http://arxiv.org/abs/2205.12808v1
Date: Wed, 25 May 2022 14:33:13 GMT
Title: Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently
Authors: Haoyuan Sun, Kwangjun Ahn, Christos Thrampoulidis, Navid Azizan
Abstract summary: We show that $p$-$textsfGD$ can noticeably affect the structure and the generalization performance of the learned models. We also show that $p$-$textsfGD$ is fully parallel in the same manner as SGD and can be used to train deep neural networks.
Score: 34.438887960077025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.

Related papers

Smoothed Normalization for Efficient Distributed Private Optimization [54.197255548244705]
Federated learning enables machine learning models with privacy of participants. There is no differentially private distributed method for training, non-feedback problems. We introduce a new distributed algorithm $alpha$-$sf NormEC$ with provable convergence guarantees.
arXiv Detail & Related papers (2025-02-19T07:10:32Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization. We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution. Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection [6.759148939470332]
We show that algorithms converge in to a hard-margin SVM with an $ell_p$-norm objective.<n>Specifically, we show that these algorithms converge in to a generalized hard-margin SVM with an $ell_p$-norm objective.
arXiv Detail & Related papers (2024-10-18T16:32:06Z)
Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization [71.35604981129838]
Traditional gradient-based bi-level optimization algorithms are ill-suited to meet the demands of large-scale applications. We introduce $(textFG)2textU$, which achieves an unbiased approximation of the meta gradient for bi-level optimization. $(textFG)2textU$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems.
arXiv Detail & Related papers (2024-06-20T08:21:52Z)
Matrix Completion via Nonsmooth Regularization of Fully Connected Neural Networks [7.349727826230864]
It has been shown that enhanced performance could be attained by using nonlinear estimators such as deep neural networks. In this paper, we control over-fitting by regularizing FCNN model in terms of norm intermediate representations. Our simulations indicate the superiority of the proposed algorithm in comparison with existing linear and nonlinear algorithms.
arXiv Detail & Related papers (2024-03-15T12:00:37Z)
A multiobjective continuation method to compute the regularization path of deep neural networks [1.3654846342364308]
Sparsity is a highly feature in deep neural networks (DNNs) since it ensures numerical efficiency, improves the interpretability of models, and robustness. We present an algorithm that allows for the entire sparse front for the above-mentioned objectives in a very efficient manner for high-dimensional gradients with millions of parameters. We demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization.
arXiv Detail & Related papers (2023-08-23T10:08:52Z)
Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method [26.874007846077884]
Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus. We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem.
arXiv Detail & Related papers (2023-05-19T18:41:17Z)
Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z)
Inability of a graph neural network heuristic to outperform greedy algorithms in solving combinatorial optimization problems like Max-Cut [0.0]
In Nature Machine Intelligence 4, 367 (2022), Schuetz et al provide a scheme to employ neural graph networks (GNN) to solve a variety of classical, NP-hard optimization problems. It describes how the network is trained on sample instances and the resulting GNN is evaluated applying widely used techniques to determine its ability to succeed. However, closer inspection shows that the reported results for this GNN are only minutely better than those for gradient descent and get outperformed by a greedy algorithm.
arXiv Detail & Related papers (2022-10-02T20:50:33Z)
Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation [4.061135251278187]
We show that a wide range of kernels gives rise to structured matrices, enabling an exact $mathcalO(n2d)$ matrix-vector multiply for gradient observations and $mathcalO(n2d2)$ for Hessian observations. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels.
arXiv Detail & Related papers (2022-06-16T17:59:48Z)
A framework for overparameterized learning [0.0]
An explanation for the success of deep neural networks is a central question in theoretical machine learning. We propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem.
arXiv Detail & Related papers (2022-05-26T17:17:46Z)
Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models. We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
Improving predictions of Bayesian neural nets via local linearization [79.21517734364093]
We argue that the Gauss-Newton approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN) Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one. We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation.
arXiv Detail & Related papers (2020-08-19T12:35:55Z)
Stochastic Flows and Geometric Optimization on the Orthogonal Group [52.50121190744979]
We present a new class of geometrically-driven optimization algorithms on the orthogonal group $O(d)$. We show that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, flows and metric learning.
arXiv Detail & Related papers (2020-03-30T15:37:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.