Modality Agnostic Efficient Long Range Encoder
- URL: http://arxiv.org/abs/2507.19409v1
- Date: Fri, 25 Jul 2025 16:19:47 GMT
- Title: Modality Agnostic Efficient Long Range Encoder
- Authors: Toufiq Parag, Ahmed Elgammal,
- Abstract summary: We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
- Score: 14.705955027331674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations -- such as token merging and modified attentions -- are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified and efficient transformer architecture designed for long-range encoding across diverse modalities. MAELRE integrates token merging with attention approximation, progressively merging tokens at different stages of internal computational blocks. It employs a lightweight attention approximation when the number of tokens is large, and switches to standard dot-product attention as the sequence becomes shorter through successive aggregation. We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models on classification tasks spanning multiple modalities, including text, time series, audio, and vision.
Related papers
- Don't Pay Attention [0.552480439325792]
Avey is a new neural foundational architecture that breaks away from both attention and recurrence.<n>Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences.<n>Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks.
arXiv Detail & Related papers (2025-06-12T21:11:06Z) - Fast correlated decoding of transversal logical algorithms [67.01652927671279]
Quantum error correction (QEC) is required for large-scale computation, but incurs a significant resource overhead.<n>Recent advances have shown that by jointly decoding logical qubits in algorithms composed of logical gates, the number of syndrome extraction rounds can be reduced.<n>Here, we reform the problem of decoding circuits by directly decoding relevant logical operator products as they propagate through the circuit.
arXiv Detail & Related papers (2025-05-19T18:00:00Z) - SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs [0.0]
We introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics.<n> Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTTF) latency at 32K tokens.
arXiv Detail & Related papers (2024-12-09T04:27:03Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.