Related papers: LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

URL: http://arxiv.org/abs/2407.04513v2
Date: Fri, 06 Dec 2024 14:20:26 GMT
Title: LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order
Authors: Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi,
Abstract summary: We show that vision transformers can adapt to arbitrary layer execution orders at test time.<n>We also show that layers learn to contribute differently based on their position in the network.<n>Our analysis shows that layers learn to contribute differently based on their position in the network.
Score: 10.362659730151591
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. With our proposed approaches, vision transformers are capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20\%) in accuracy at the same model size. We analyse the feature representations of our trained models as well as how each layer contributes to the models prediction based on its position during inference. Our analysis shows that layers learn to contribute differently based on their position in the network. Finally, we layer-prune our models at test time and find that their performance declines gracefully. Code available at https://github.com/matfrei/layershuffle.

Related papers

Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network) After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
How Do Training Methods Influence the Utilization of Vision Models? [23.41975772383921]
Not all learnable parameters contribute equally to a neural network's decision function. We revisit earlier studies that examined how architecture and task complexity influence this phenomenon. Our findings reveal that the training method strongly influences which layers become critical to the decision function for a given task.
arXiv Detail & Related papers (2024-10-18T13:54:46Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
One-Shot Pruning for Fast-adapting Pre-trained Models on Devices [28.696989086706186]
Large-scale pre-trained models have been remarkably successful in resolving downstream tasks. deploying these models on low-capability devices still requires an effective approach, such as model pruning. We present a scalable one-shot pruning method that leverages pruned knowledge of similar tasks to extract a sub-network from the pre-trained model for a new task.
arXiv Detail & Related papers (2023-07-10T06:44:47Z)
Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy. Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z)
Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z)
Robust Binary Models by Pruning Randomly-initialized Networks [57.03100916030444]
We propose ways to obtain robust models against adversarial attacks from randomly-d binary networks. We learn the structure of the robust model by pruning a randomly-d binary network. Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T00:05:08Z)
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal. We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z)
Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance. The additionality boosts the robustness of visual features and strengthens privacy. We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z)
MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [69.76837484008033]
An unresolved problem in Deep Learning is the ability of neural networks to cope with domain shifts during test-time. We combine meta-learning, self-supervision and test-time training to learn to adapt to unseen test distributions. Our approach significantly improves the state-of-the-art results on the CIFAR-10-Corrupted image classification benchmark.
arXiv Detail & Related papers (2021-03-30T09:33:38Z)
Auto-tuning of Deep Neural Networks by Conflicting Layer Removal [0.0]
We introduce a novel methodology to identify layers that decrease the test accuracy of trained models. Conflicting layers are detected as early as the beginning of training. We will show that around 60% of the layers of trained residual networks can be completely removed from the architecture.
arXiv Detail & Related papers (2021-03-07T11:51:55Z)
IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)
Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive Compression [40.35734017517066]
Nested networks or slimmable networks are neural networks whose architectures can be adjusted instantly during testing time. Recent studies have focused on a "nested dropout" layer, which is able to order the nodes of a layer by importance during training.
arXiv Detail & Related papers (2021-01-27T12:34:58Z)
Reusing Trained Layers of Convolutional Neural Networks to Shorten Hyperparameters Tuning Time [1.160208922584163]
This paper describes a proposal to reuse the weights of hidden (convolutional) layers among different trainings to shorten this process. The experiments compare the training time and the validation loss when reusing and not reusing convolutional layers. They confirm that this strategy reduces the training time while it even increases the accuracy of the resulting neural network.
arXiv Detail & Related papers (2020-06-16T11:39:39Z)
Fitting the Search Space of Weight-sharing NAS with Graph Convolutional Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks. With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z)
Novelty Detection via Non-Adversarial Generative Network [47.375591404354765]
A novel decoder-encoder framework is proposed for novelty detection task. Under the non-adversarial framework, both latent space and image reconstruction space are jointly optimized. Our model has the clear superiority over cutting-edge novelty detectors and achieves the state-of-the-art results on the datasets.
arXiv Detail & Related papers (2020-02-03T01:05:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.