Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling
- URL: http://arxiv.org/abs/2507.00453v1
- Date: Tue, 01 Jul 2025 06:11:38 GMT
- Title: Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling
- Authors: Ankit Kashyap,
- Abstract summary: We present a Transformer architecture for language modeling that combines global attention with two biologically inspired components.<n>This unified attention block allows the model to efficiently handle both short-range and long-range dependencies.<n>The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block allows the model to efficiently handle both short-range and long-range dependencies without increasing attention cost quadratically. The memory module persistently stores past token representations using a gated update mechanism inspired by recurrent networks. Rotary positional encoding is applied per attention head to enable directionally disentangled, scale-invariant positional signals. The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries, enabling transparent and modular experimentation. Our model offers a lightweight and extensible design for tasks such as dialogue modeling, code completion, and document understanding.
Related papers
- Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons [0.0]
We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows.<n>Unlike traditional Transformer designs, which suffer from quadratic memory and overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.
arXiv Detail & Related papers (2025-05-09T00:25:46Z) - Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation [0.0]
Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models.
We propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure.
Our architecture is characterized by low complexity, token-free design, absence of positional embeddings, uniformity, and scalability.
arXiv Detail & Related papers (2024-11-09T08:58:57Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Cached Transformers: Improving Transformers with Differentiable Memory
Cache [71.28188777209034]
This work introduces a new Transformer model called Cached Transformer.
It uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.
arXiv Detail & Related papers (2023-12-20T03:30:51Z) - Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [47.7036344302777]
Current Object Video reference methods follow the pipeline of extraction-then-matching.<n>We propose a unified VOS framework, coined as JointFormer, for jointly feature modeling, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z) - Landmark Attention: Random-Access Infinite Context Length for
Transformers [45.69864961773124]
We present a novel approach that allows access to the complete context while retaining random-access flexibility.
Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks.
We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
arXiv Detail & Related papers (2023-05-25T17:53:42Z) - DAE-Former: Dual Attention-guided Efficient Transformer for Medical
Image Segmentation [3.9548535445908928]
We propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism.
Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights.
arXiv Detail & Related papers (2022-12-27T14:39:39Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z) - Adaptive Semiparametric Language Models [17.53604394786977]
We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component.
Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method.
arXiv Detail & Related papers (2021-02-04T11:47:03Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.