Related papers: Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

URL: http://arxiv.org/abs/2510.22109v1
Date: Sat, 25 Oct 2025 01:29:37 GMT
Title: Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
Authors: Billy Dickson, Zoran Tiganj,
Abstract summary: We introduce an approach that modifies the input representation itself, rather than the transformer architecture.<n>Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens.
Score: 3.437656066916039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer's long-range memory.

Related papers

ScaleFormer: Span Representation Cumulation for Long-Context Transformer [9.845891949404534]
We propose a plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences.<n>Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder.<n> Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2025-11-13T07:05:45Z)
Transformers from Compressed Representations [74.48571451824569]
TEMPEST (TransformErs froM comPressed rEpreSenTations) is a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy.<n>Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage.
arXiv Detail & Related papers (2025-10-26T13:48:03Z)
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning [16.35681450323654]
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute.<n>We give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning.<n>Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
arXiv Detail & Related papers (2025-05-22T17:33:49Z)
Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling [2.9228447484533695]
The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs)<n>One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths.<n>One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity.
arXiv Detail & Related papers (2024-12-08T23:41:38Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? [27.58916930770997]
We show a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets.
arXiv Detail & Related papers (2023-11-22T02:23:32Z)
Context Compression for Auto-regressive Transformers with Sentinel Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z)
Resource-Efficient Separation Transformer [14.666016177212837]
This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings.
arXiv Detail & Related papers (2022-06-19T23:37:24Z)
Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths. We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately. The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Text Compression-aided Transformer Encoding [77.16960983003271]
We propose explicit and implicit text compression approaches to enhance the Transformer encoding. backbone information, meaning the gist of the input text, is not specifically focused on. Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines.
arXiv Detail & Related papers (2021-02-11T11:28:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.