Trainable Transformer in Transformer
- URL: http://arxiv.org/abs/2307.01189v2
- Date: Thu, 8 Feb 2024 16:19:14 GMT
- Title: Trainable Transformer in Transformer
- Authors: Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora
- Abstract summary: We propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference.
TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers.
These findings suggest that large pre-trained language models are capable of performing intricate inferences.
- Score: 48.754918968374334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works attribute the capability of in-context learning (ICL) in large
pre-trained language models to implicitly simulating and fine-tuning an
internal model (e.g., linear or 2-layer MLP) during inference. However, such
constructions require large memory overhead, which makes simulation of more
sophisticated internal models intractable. In this work, we propose an
efficient construction, Transformer in Transformer (in short, TinT), that
allows a transformer to simulate and fine-tune complex models internally during
inference (e.g., pre-trained language models). In particular, we introduce
innovative approximation techniques that allow a TinT model with less than 2
billion parameters to simulate and fine-tune a 125 million parameter
transformer model within a single forward pass. TinT accommodates many common
transformer variants and its design ideas also improve the efficiency of past
instantiations of simple models inside transformers. We conduct end-to-end
experiments to validate the internal fine-tuning procedure of TinT on various
language modeling and downstream tasks. For example, even with a limited
one-step budget, we observe TinT for a OPT-125M model improves performance by
4-16% absolute on average compared to OPT-125M. These findings suggest that
large pre-trained language models are capable of performing intricate
subroutines. To facilitate further work, a modular and extensible codebase for
TinT is included.
Related papers
- Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs)
We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task.
In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z) - DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer.
In-depth theoretical analysis shows that model updates can be bounded in a stable way.
We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Adding Recurrence to Pretrained Transformers for Improved Efficiency and
Context Size [41.624797099537375]
We present a novel method for applying pretrained transformer language models.
We find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora.
arXiv Detail & Related papers (2020-08-16T23:19:30Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.