Memory-Efficient Differentiable Transformer Architecture Search
- URL: http://arxiv.org/abs/2105.14669v1
- Date: Mon, 31 May 2021 01:52:36 GMT
- Title: Memory-Efficient Differentiable Transformer Architecture Search
- Authors: Yuekai Zhao, Li Dong, Yelong Shen, Zhihua Zhang, Furu Wei, Weizhu Chen
- Abstract summary: We propose a multi-split reversible network and combine it with DARTS.
Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs.
We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech.
- Score: 59.47253706925725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Differentiable architecture search (DARTS) is successfully applied in many
vision tasks. However, directly using DARTS for Transformers is
memory-intensive, which renders the search process infeasible. To this end, we
propose a multi-split reversible network and combine it with DARTS.
Specifically, we devise a backpropagation-with-reconstruction algorithm so that
we only need to store the last layer's outputs. By relieving the memory burden
for DARTS, it allows us to search with larger hidden size and more candidate
operations. We evaluate the searched architecture on three sequence-to-sequence
datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14
English-Czech. Experimental results show that our network consistently
outperforms standard Transformers across the tasks. Moreover, our method
compares favorably with big-size Evolved Transformers, reducing search
computation by an order of magnitude.
Related papers
- Separations in the Representational Capabilities of Transformers and Recurrent Architectures [27.783705012503237]
We analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance.
We show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size.
We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass.
arXiv Detail & Related papers (2024-06-13T17:31:30Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Searching the Search Space of Vision Transformer [98.96601221383209]
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection.
We propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space.
We provide design guidelines of general vision transformers with extensive analysis according to the space searching process.
arXiv Detail & Related papers (2021-11-29T17:26:07Z) - Quality and Cost Trade-offs in Passage Re-ranking Task [0.0]
The paper is devoted to the problem of how to choose the right architecture for the ranking step of the information retrieval pipeline.
We investigated several late-interaction models such as Colbert and Poly-encoder architectures along with their modifications.
Also, we took care of the memory footprint of the search index and tried to apply the learning-to-hash method to binarize the output vectors from the transformer encoders.
arXiv Detail & Related papers (2021-11-18T19:47:45Z) - Distilling Transformers for Neural Cross-Domain Search [9.865125804658991]
We argue that sequence-to-sequence models are a conceptually ideal---albeit highly impractical---retriever.
We derive a new distillation objective, implementing it as a data augmentation scheme.
Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark.
arXiv Detail & Related papers (2021-08-06T22:30:19Z) - GLiT: Neural Architecture Search for Global and Local Image Transformer [114.8051035856023]
We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition.
Our method can find more discriminative and efficient transformer variants than the ResNet family and the baseline ViT for image classification.
arXiv Detail & Related papers (2021-07-07T00:48:09Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.