Related papers: EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

URL: http://arxiv.org/abs/2503.22196v1
Date: Fri, 28 Mar 2025 07:26:37 GMT
Title: EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
Authors: Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, Xiaoxin Chen,
Abstract summary: Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices.<n>We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs.
Score: 3.739419555718102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

Related papers

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices [3.5487823143282657]
This paper introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering.<n> Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.
arXiv Detail & Related papers (2025-08-01T07:03:16Z)
mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling [0.5236468296934584]
mGRADE is a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit.<n>We demonstrate that mGRADE effectively separates and preserves multi-scale temporal features.<n>This highlights mGRADE's promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
arXiv Detail & Related papers (2025-07-02T15:44:35Z)
Echo State Transformer: When chaos brings memory [2.07180164747172]
We introduce Echo State Transformers (EST), a hybrid architecture for sequential data processing.<n>EST integrates the Transformer attention mechanisms with Reservoir Computing principles to create a fixedsize window distributed memory system.<n>EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers.
arXiv Detail & Related papers (2025-06-25T09:56:25Z)
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
Compact Recurrent Transformer with Persistent Memory [16.48606806238812]
The Transformer architecture has shown significant success in many language processing and visual tasks.<n>We propose a novel and efficient Compact Recurrent Transformer (CRT)<n>CRT combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector.<n>We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification.
arXiv Detail & Related papers (2025-05-02T00:11:44Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [0.5899781520375794]
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state. InfiniGen is a novel KV cache management framework tailored for long-text generation.
arXiv Detail & Related papers (2024-06-28T07:41:26Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
IMBUE: In-Memory Boolean-to-CUrrent Inference ArchitecturE for Tsetlin Machines [5.6634493664726495]
In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality. Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications. This paper proposes an In-Memory Boolean-to-Current Inference Architecture (IMBUE) that uses ReRAM-transistor cells to eliminate the need for such conversions.
arXiv Detail & Related papers (2023-05-22T10:55:01Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation [104.44478403427881]
EdgeFormer is a parameter-efficient Transformer of the encoder-decoder architecture for on-device seq2seq generation. We conduct experiments on two practical on-device seq2seq tasks: Machine Translation and Grammatical Error Correction.
arXiv Detail & Related papers (2022-02-16T10:10:00Z)
Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory [23.022723184325017]
Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. We propose a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories.
arXiv Detail & Related papers (2020-05-16T16:54:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.