Freely Long-Thinking Transformer (FraiLT)
- URL: http://arxiv.org/abs/2401.11626v2
- Date: Sat, 24 Feb 2024 12:11:56 GMT
- Title: Freely Long-Thinking Transformer (FraiLT)
- Authors: Akbay Tabak
- Abstract summary: Freely Long-Thinking Transformer (FraiLT) is an improved transformer model designed to enhance processing capabilities without scaling up size.
FraiLT utilizes a recursion approach, iterating over a subset of layers multiple times, and introduces iteration encodings to maintain awareness across these cycles.
When evaluated on a synthetic story dataset, FraiLT outperformed larger models, showcasing its ability to deliver high-quality performance while reducing memory demands.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Freely Long-Thinking Transformer (FraiLT) is an improved transformer model
designed to enhance processing capabilities without scaling up size. It
utilizes a recursive approach, iterating over a subset of layers multiple
times, and introduces iteration encodings to maintain awareness across these
cycles. Iteration encoding allows FraiLT to achieve the interpretive depth of
larger models in a compact form. When evaluated on a synthetic story dataset,
FraiLT outperformed larger models, showcasing its ability to deliver
high-quality performance while reducing memory demands. This model represents a
step forward towards more efficient and accessible language models.
Related papers
- Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - READ: Recurrent Adaptation of Large Transformers [7.982905666062059]
Fine-tuning large-scale Transformers becomes impractical as the model size and number of tasks increase.
We introduce textbfREcurrent textbfADaption (READ) -- a lightweight and memory-efficient fine-tuning method.
arXiv Detail & Related papers (2023-05-24T16:59:41Z) - MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z) - I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Sliced Recursive Transformer [23.899076070924153]
Recursive operation on vision transformers can improve parameter utilization without involving additional parameters.
Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
arXiv Detail & Related papers (2021-11-09T17:59:14Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.