Implicitly Maximizing Margins with the Hinge Loss
- URL: http://arxiv.org/abs/2006.14286v1
- Date: Thu, 25 Jun 2020 10:04:16 GMT
- Title: Implicitly Maximizing Margins with the Hinge Loss
- Authors: Justin Lizama
- Abstract summary: We show that for a linear classifier on linearly separable data with fixed step size, the margin of this modified hinge loss converges to the $ell$ max-margin at the rate of $mathcalO( 1/t )$.
empirical results suggest that this increased speed carries over to ReLU networks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A new loss function is proposed for neural networks on classification tasks
which extends the hinge loss by assigning gradients to its critical points. We
will show that for a linear classifier on linearly separable data with fixed
step size, the margin of this modified hinge loss converges to the $\ell_2$
max-margin at the rate of $\mathcal{O}( 1/t )$. This rate is fast when compared
with the $\mathcal{O}(1/\log t)$ rate of exponential losses such as the
logistic loss. Furthermore, empirical results suggest that this increased
convergence speed carries over to ReLU networks.
Related papers
- Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Gradient Descent Converges Linearly for Logistic Regression on Separable
Data [17.60502131429094]
We show that running gradient descent with variable learning rate guarantees loss $f(x) leq 1.1 cdot f(x*) + epsilon$ the logistic regression objective.
We also apply our ideas to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff.
arXiv Detail & Related papers (2023-06-26T02:15:26Z) - Nonparametric regression with modified ReLU networks [77.34726150561087]
We consider regression estimation with modified ReLU neural networks in which network weight matrices are first modified by a function $alpha$ before being multiplied by input vectors.
arXiv Detail & Related papers (2022-07-17T21:46:06Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Fast Margin Maximization via Dual Acceleration [52.62944011696364]
We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss.
This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
arXiv Detail & Related papers (2021-07-01T16:36:39Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - When does gradient descent with logistic loss find interpolating
two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.