Block Transformer: Global-to-Local Language Modeling for Fast Inference
- URL: http://arxiv.org/abs/2406.02657v2
- Date: Fri, 01 Nov 2024 08:52:18 GMT
- Title: Block Transformer: Global-to-Local Language Modeling for Fast Inference
- Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun,
- Abstract summary: Self-attention requires the key-value cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information.
We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention.
We demonstrate that Block Transformers reach 10--20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance.
- Score: 41.361135177431336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache. Second, computation of subsequent tokens is bottlenecked by the high memory I/O demand of fetching the entire KV cache, which grows linearly with sequence length, incurring quadratic memory reads overall. We design the Block Transformer to strategically mitigate these costs, by incorporating coarsity and locality into an integrated global-to-local architecture. At the lower layers, we aggregate tokens into fixed size blocks to apply attention across the entire sequence at coarse-grained detail, to capture the global context while minimizing KV cache overhead. At upper layers, we apply attention within each block to decode individual tokens, to model fine-grained details with a lightweight local KV cache. We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10--20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. Code is available at https://github.com/itsnamgyu/block-transformer.
Related papers
- SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers [15.142822497807236]
We propose SCOUT, a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations.<n>SCOUT retains much of the expressivity of full attention while substantially reducing the computational and memory cost.<n>We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks.
arXiv Detail & Related papers (2025-08-31T17:08:33Z) - Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning [16.35681450323654]
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute.<n>We give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning.<n>Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
arXiv Detail & Related papers (2025-05-22T17:33:49Z) - PiT: Progressive Diffusion Transformer [50.46345527963736]
We propose a series of Pseudo textbfProgressive Dtextbfiffusion textbfTransformer (textbfPiT)<n>Our proposed PiT-L achieves 54%$uparrow$ FID improvement over DiT-XL/2 while using less computation.
arXiv Detail & Related papers (2025-05-19T15:02:33Z) - Tensor Product Attention Is All You Need [53.69820973900921]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>Based on TPA, we introduce the Product Attention Transformer,(T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Attamba: Attending To Multi-Token States [6.5676809841642125]
We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens.
We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking.
Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
arXiv Detail & Related papers (2024-11-26T18:52:06Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Context Compression for Auto-regressive Transformers with Sentinel
Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones.
Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z) - Recurrent Memory Transformer [0.3529736140137003]
We study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer)
We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence.
Our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing.
arXiv Detail & Related papers (2022-07-14T13:00:22Z) - Block-Skim: Efficient Question Answering for Transformer [25.429122678247452]
We propose Block-Skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance.
We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup.
Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.
arXiv Detail & Related papers (2021-12-16T01:45:33Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.