Training Transformers with Enforced Lipschitz Constants
- URL: http://arxiv.org/abs/2507.13338v1
- Date: Thu, 17 Jul 2025 17:55:00 GMT
- Title: Training Transformers with Enforced Lipschitz Constants
- Authors: Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola,
- Abstract summary: We train neural networks with Lipschitz bounds enforced throughout training.<n>We find that switching from AdamW to Muon improves standard methods.<n>Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff.
- Score: 25.42378506132261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.
Related papers
- LipShiFT: A Certifiably Robust Shift-based Vision Transformer [46.7028906678548]
Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model.<n>We provide an upper bound estimate for the Lipschitz constants of this model using the $l$ norm on common image classification.
arXiv Detail & Related papers (2025-03-18T21:38:18Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Certified Robust Models with Slack Control and Large Lipschitz Constants [102.69689641398227]
We propose a Calibrated Lipschitz-Margin Loss (CLL) that addresses two problems.
Firstly, commonly used margin losses do not adjust the penalties to the shrinking output distribution.
Secondly, minimization of $K$ can lead to overly smooth decision functions.
Our CLL addresses these issues by explicitly calibrating the loss w.r.t. margin and Lipschitz constant.
arXiv Detail & Related papers (2023-09-12T12:23:49Z) - LipsFormer: Introducing Lipschitz Continuity to Vision Transformers [15.568629066375971]
We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability for Transformer-based models.
Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning.
LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.
arXiv Detail & Related papers (2023-04-19T17:59:39Z) - CertViT: Certified Robustness of Pre-Trained Vision Transformers [11.880271015435582]
Lipschitz bounded neural networks are certifiably robust and have a good trade-off between clean and certified accuracy.
Existing Lipschitz bounding methods train from scratch and are limited to moderately sized networks.
We show that CertViT networks have better certified accuracy than state-of-the-art Lipschitz trained networks.
arXiv Detail & Related papers (2023-02-01T06:09:19Z) - Training Certifiably Robust Neural Networks with Efficient Local
Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications.
We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z) - Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption.
They can suffer from ill-posedness and convergence instability.
This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z) - On Lipschitz Regularization of Convolutional Layers using Toeplitz
Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning.
computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard.
We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.