EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
- URL: http://arxiv.org/abs/2503.22196v1
- Date: Fri, 28 Mar 2025 07:26:37 GMT
- Title: EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
- Authors: Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, Xiaoxin Chen,
- Abstract summary: Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices.<n>We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs.
- Score: 3.739419555718102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.
Related papers
- Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [0.5899781520375794]
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks.
serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state.
InfiniGen is a novel KV cache management framework tailored for long-text generation.
arXiv Detail & Related papers (2024-06-28T07:41:26Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - IMBUE: In-Memory Boolean-to-CUrrent Inference ArchitecturE for Tsetlin
Machines [5.6634493664726495]
In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality.
Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications.
This paper proposes an In-Memory Boolean-to-Current Inference Architecture (IMBUE) that uses ReRAM-transistor cells to eliminate the need for such conversions.
arXiv Detail & Related papers (2023-05-22T10:55:01Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq
Generation [104.44478403427881]
EdgeFormer is a parameter-efficient Transformer of the encoder-decoder architecture for on-device seq2seq generation.
We conduct experiments on two practical on-device seq2seq tasks: Machine Translation and Grammatical Error Correction.
arXiv Detail & Related papers (2022-02-16T10:10:00Z) - Streaming Transformer-based Acoustic Models Using Self-attention with
Augmented Memory [23.022723184325017]
Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion.
We propose a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories.
arXiv Detail & Related papers (2020-05-16T16:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.