Related papers: IOT: Instance-wise Layer Reordering for Transformer Structures

IOT: Instance-wise Layer Reordering for Transformer Structures

URL: http://arxiv.org/abs/2103.03457v1
Date: Fri, 5 Mar 2021 03:44:42 GMT
Title: IOT: Instance-wise Layer Reordering for Transformer Structures
Authors: Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu
Abstract summary: We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
Score: 173.39918590438245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost the same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github.

Related papers

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks. We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
Exploring vision transformer layer choosing for semantic segmentation [1.2891210250935146]
We propose a neck network for adaptive fusion and feature selection, called ViTController. We validate the effectiveness of our method on different datasets and models. Our method can also be used as a plug-in module and inserted into different networks.
arXiv Detail & Related papers (2023-05-02T09:29:12Z)
Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z)
Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z)
A Reinforcement Learning Approach for Sequential Spatial Transformer Networks [6.585049648605185]
We formulate the task as a Markovian Decision Process (MDP) and use RL to solve this sequential decision-making problem. In our method, we are not bound to the differentiability of the sampling modules. We design multiple experiments to verify the effectiveness of our method using cluttered MNIST and Fashion-MNIST datasets.
arXiv Detail & Related papers (2021-06-27T17:41:17Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z)
Self-Supervised Variational Auto-Encoders [10.482805367361818]
We present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE) This class of models allows to perform both conditional and unconditional sampling, while simplifying the objective function. We present performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA)
arXiv Detail & Related papers (2020-10-05T13:42:28Z)
schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions. In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.