IOT: Instance-wise Layer Reordering for Transformer Structures
- URL: http://arxiv.org/abs/2103.03457v1
- Date: Fri, 5 Mar 2021 03:44:42 GMT
- Title: IOT: Instance-wise Layer Reordering for Transformer Structures
- Authors: Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou,
Houqiang Li, Tie-Yan Liu
- Abstract summary: We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
- Score: 173.39918590438245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With sequentially stacked self-attention, (optional) encoder-decoder
attention, and feed-forward layers, Transformer achieves big success in natural
language processing (NLP), and many variants have been proposed. Currently,
almost all these models assume that the layer order is fixed and kept the same
across data samples. We observe that different data samples actually favor
different orders of the layers. Based on this observation, in this work, we
break the assumption of the fixed layer order in the Transformer and introduce
instance-wise layer reordering into the model structure. Our Instance-wise
Ordered Transformer (IOT) can model variant functions by reordered layers,
which enables each sample to select the better one to improve the model
performance under the constraint of almost the same number of parameters. To
achieve this, we introduce a light predictor with negligible parameter and
inference cost to decide the most capable and favorable layer order for any
input sequence. Experiments on 3 tasks (neural machine translation, abstractive
summarization, and code generation) and 9 datasets demonstrate consistent
improvements of our method. We further show that our method can also be applied
to other architectures beyond Transformer. Our code is released at Github.
Related papers
- Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Exploring vision transformer layer choosing for semantic segmentation [1.2891210250935146]
We propose a neck network for adaptive fusion and feature selection, called ViTController.
We validate the effectiveness of our method on different datasets and models.
Our method can also be used as a plug-in module and inserted into different networks.
arXiv Detail & Related papers (2023-05-02T09:29:12Z) - Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction.
This obscures the internal decision-making process of the model and the utility of its intermediate representations.
We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z) - Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training.
We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively.
Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z) - A Reinforcement Learning Approach for Sequential Spatial Transformer
Networks [6.585049648605185]
We formulate the task as a Markovian Decision Process (MDP) and use RL to solve this sequential decision-making problem.
In our method, we are not bound to the differentiability of the sampling modules.
We design multiple experiments to verify the effectiveness of our method using cluttered MNIST and Fashion-MNIST datasets.
arXiv Detail & Related papers (2021-06-27T17:41:17Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via
Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance.
We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z) - Self-Supervised Variational Auto-Encoders [10.482805367361818]
We present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE)
This class of models allows to perform both conditional and unconditional sampling, while simplifying the objective function.
We present performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA)
arXiv Detail & Related papers (2020-10-05T13:42:28Z) - schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model.
We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions.
In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.