Are Transformers More Robust? Towards Exact Robustness Verification for
Transformers
- URL: http://arxiv.org/abs/2202.03932v4
- Date: Fri, 19 May 2023 10:54:49 GMT
- Title: Are Transformers More Robust? Towards Exact Robustness Verification for
Transformers
- Authors: Brian Hsuan-Cheng Liao, Chih-Hong Cheng, Hasan Esen, Alois Knoll
- Abstract summary: We study the robustness problem of Transformers, a key characteristic as low robustness may cause safety concerns.
Specifically, we focus on Sparsemax-based Transformers and reduce the finding of their maximum robustness to a Mixed Quadratically Constrained Programming (MIQCP) problem.
We then conduct experiments using the application of Land Departure to compare the robustness of Sparsemax-based Transformers against that of the more conventional Multi-Layer-Perceptron (MLP) NNs.
- Score: 3.2259574483835673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an emerging type of Neural Networks (NNs), Transformers are used in many
domains ranging from Natural Language Processing to Autonomous Driving. In this
paper, we study the robustness problem of Transformers, a key characteristic as
low robustness may cause safety concerns. Specifically, we focus on
Sparsemax-based Transformers and reduce the finding of their maximum robustness
to a Mixed Integer Quadratically Constrained Programming (MIQCP) problem. We
also design two pre-processing heuristics that can be embedded in the MIQCP
encoding and substantially accelerate its solving. We then conduct experiments
using the application of Land Departure Warning to compare the robustness of
Sparsemax-based Transformers against that of the more conventional
Multi-Layer-Perceptron (MLP) NNs. To our surprise, Transformers are not
necessarily more robust, leading to profound considerations in selecting
appropriate NN architectures for safety-critical domain applications.
Related papers
- Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks.
In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective.
We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Neural Architecture Search on Efficient Transformers and Beyond [23.118556295894376]
We propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique.
We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer.
Our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.
arXiv Detail & Related papers (2022-07-28T08:41:41Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Transformer Acceleration with Dynamic Sparse Attention [20.758709319088865]
We propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers.
Our approach can achieve better trade-offs between accuracy and model complexity.
arXiv Detail & Related papers (2021-10-21T17:31:57Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models.
We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.