LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
- URL: http://arxiv.org/abs/2304.09856v1
- Date: Wed, 19 Apr 2023 17:59:39 GMT
- Title: LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
- Authors: Xianbiao Qi, Jianan Wang, Yihao Chen, Yukai Shi, Lei Zhang
- Abstract summary: We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability for Transformer-based models.
Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning.
LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.
- Score: 15.568629066375971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Lipschitz continuous Transformer, called LipsFormer, to pursue
training stability both theoretically and empirically for Transformer-based
models. In contrast to previous practical tricks that address training
instability by learning rate warmup, layer normalization, attention
formulation, and weight initialization, we show that Lipschitz continuity is a
more essential property to ensure training stability. In LipsFormer, we replace
unstable Transformer component modules with Lipschitz continuous counterparts:
CenterNorm instead of LayerNorm, spectral initialization instead of Xavier
initialization, scaled cosine similarity attention instead of dot-product
attention, and weighted residual shortcut. We prove that these introduced
modules are Lipschitz continuous and derive an upper bound on the Lipschitz
constant of LipsFormer. Our experiments show that LipsFormer allows stable
training of deep Transformer architectures without the need of careful learning
rate tuning such as warmup, yielding a faster convergence and better
generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny
based on Swin Transformer training for 300 epochs can obtain 82.7\% without any
learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training
for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M
parameters. The code will be released at
\url{https://github.com/IDEA-Research/LipsFormer}.
Related papers
- Training Transformers with Enforced Lipschitz Constants [25.42378506132261]
We train neural networks with Lipschitz bounds enforced throughout training.<n>We find that switching from AdamW to Muon improves standard methods.<n>Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff.
arXiv Detail & Related papers (2025-07-17T17:55:00Z) - LipShiFT: A Certifiably Robust Shift-based Vision Transformer [46.7028906678548]
Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model.
We provide an upper bound estimate for the Lipschitz constants of this model using the $l$ norm on common image classification.
arXiv Detail & Related papers (2025-03-18T21:38:18Z) - DP-SGD Without Clipping: The Lipschitz Neural Network Way [5.922390405022253]
State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN)
By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees.
Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees.
arXiv Detail & Related papers (2023-05-25T16:05:46Z) - Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram
Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory.
We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability.
It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z) - CertViT: Certified Robustness of Pre-Trained Vision Transformers [11.880271015435582]
Lipschitz bounded neural networks are certifiably robust and have a good trade-off between clean and certified accuracy.
Existing Lipschitz bounding methods train from scratch and are limited to moderately sized networks.
We show that CertViT networks have better certified accuracy than state-of-the-art Lipschitz trained networks.
arXiv Detail & Related papers (2023-02-01T06:09:19Z) - Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training.
We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer.
We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z) - Chordal Sparsity for Lipschitz Constant Estimation of Deep Neural
Networks [77.82638674792292]
Lipschitz constants of neural networks allow for guarantees of robustness in image classification, safety in controller design, and generalizability beyond the training data.
As calculating Lipschitz constants is NP-hard, techniques for estimating Lipschitz constants must navigate the trade-off between scalability and accuracy.
In this work, we significantly push the scalability frontier of a semidefinite programming technique known as LipSDP while achieving zero accuracy loss.
arXiv Detail & Related papers (2022-04-02T11:57:52Z) - Training Certifiably Robust Neural Networks with Efficient Local
Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications.
We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z) - Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption.
They can suffer from ill-posedness and convergence instability.
This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z) - On Lipschitz Regularization of Convolutional Layers using Toeplitz
Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning.
computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard.
We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z) - The Lipschitz Constant of Self-Attention [27.61634862685452]
Lipschitz constants of neural networks have been explored in various contexts in deep learning.
We investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling.
arXiv Detail & Related papers (2020-06-08T16:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.