Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms
- URL: http://arxiv.org/abs/2510.23012v1
- Date: Mon, 27 Oct 2025 05:16:04 GMT
- Title: Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms
- Authors: Pravin Nair,
- Abstract summary: We prove that the softmax function is contractive with the Lipschitz constant $1/2$.<n>To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity.
- Score: 5.600982367387832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
Related papers
- Revisiting Inexact Fixed-Point Iterations for Min-Max Problems: Stochasticity and Structured Nonconvexity [18.427215139020632]
We show that $1L 1$ can be used to improve some state-of-the-art problems even for a multilevel Carlo iteration.
We provide an analysis for inexact Halperness estimators for $1L 1$ when the only hold with respect to a solution is a new $1L 1$ theory.
arXiv Detail & Related papers (2024-02-07T18:22:41Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Nonconvex-Nonconcave Min-Max Optimization with a Small Maximization
Domain [11.562923882714093]
We study the problem of finding approximate first-order stationary points in optimization problems of the form $min_x in max_y in Y f(x,y)
Our approach relies upon replacing the function $f(x,cdot)$ with its $kth order Taylor approximation (in $y$) and finding a near-stationary point in $Y$.
arXiv Detail & Related papers (2021-10-08T07:46:18Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with
Error Certificates [0.0]
We study the problem of zeroth-order (black-box) optimization of a Lipschitz function $f$ defined on a compact subset $mathcal X$ of $mathbb Rd$.
We characterize the optimal number of evaluations of any Lipschitz function $f$ to find and certify an approximater of $f$ at accuracy $varepsilon$.
arXiv Detail & Related papers (2021-02-03T09:51:03Z) - The Complexity of Constrained Min-Max Optimization [29.57458485068705]
We show that an approximate local point large enough min-max is guaranteed to exist.
More importantly, we show an approximate fixed gradient Descent/Ascent approximation complete.
Our result is the first to show an exponential approximation of two fundamental optimization problems.
arXiv Detail & Related papers (2020-09-21T05:54:12Z) - Second-Order Information in Non-Convex Stochastic Optimization: Power
and Limitations [54.42518331209581]
We find an algorithm which finds.
epsilon$-approximate stationary point (with $|nabla F(x)|le epsilon$) using.
$(epsilon,gamma)$surimate random random points.
Our lower bounds here are novel even in the noiseless case.
arXiv Detail & Related papers (2020-06-24T04:41:43Z) - Greedy Adversarial Equilibrium: An Efficient Alternative to
Nonconvex-Nonconcave Min-Max Optimization [28.431572772564518]
We show that Lipschitzitz's $varepsilon$-greedy adversarial model converges from any starting point to a $max_z f(x, z)$.
We also show that Lipschitz's $nabla_y f(x,y)$ is in the dimension $d$, $1/varepsilon$, and the bounds on $nabla2_y f(x,y)$ are $nabla2_y.
arXiv Detail & Related papers (2020-06-22T16:03:41Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.