Intra-Layer Recurrence in Transformers for Language Modeling
- URL: http://arxiv.org/abs/2505.01855v2
- Date: Fri, 23 May 2025 19:19:41 GMT
- Title: Intra-Layer Recurrence in Transformers for Language Modeling
- Authors: Anthony Nguyen, Wenjun Lin,
- Abstract summary: Intra-Layer Recurrence (ILR) is a more targeted approach that applies recurrence selectively to individual layers within a single forward pass.<n>Our experiments show that allocating more iterations to earlier layers yields optimal results.
- Score: 0.03320194947871346
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
Related papers
- RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner.<n>This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z) - Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.<n>We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Causal Transformers Perform Below Chance on Recursive Nested
Constructions, Unlike Humans [7.897143833642971]
We test four different Transformer LMs on two different types of nested constructions.
We find that Transformers achieve near-perfect performance on short-range embedded dependencies.
On long-range embedded dependencies, Transformers' performance sharply drops below chance level.
arXiv Detail & Related papers (2021-10-14T09:22:17Z) - Leveraging redundancy in attention with Reuse Transformers [58.614198953733194]
Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way.
A typical Transformer model computes such pairwise attention scores repeatedly for the same sequence.
We propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers.
arXiv Detail & Related papers (2021-10-13T16:08:02Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Deriving Differential Target Propagation from Iterating Approximate
Inverses [91.3755431537592]
We show that a particular form of target propagation, relying on learned inverses of each layer, which is differential, gives rise to an update rule which corresponds to an approximate Gauss-Newton gradient-based optimization.
We consider several iterative calculations based on local auto-encoders at each layer in order to achieve more precise inversions for more accurate target propagation.
arXiv Detail & Related papers (2020-07-29T22:34:45Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Learned Multi-layer Residual Sparsifying Transform Model for Low-dose CT
Reconstruction [11.470070927586017]
Sparsifying transform learning involves highly efficient sparse coding and operator update steps.
We propose a Multi-layer Residual Sparsifying Transform (MRST) learning model wherein the transform domain residuals are jointly sparsified over layers.
arXiv Detail & Related papers (2020-05-08T02:36:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.