Zebra: Extending Context Window with Layerwise Grouped Local-Global
Attention
- URL: http://arxiv.org/abs/2312.08618v1
- Date: Thu, 14 Dec 2023 02:45:31 GMT
- Title: Zebra: Extending Context Window with Layerwise Grouped Local-Global
Attention
- Authors: Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, Dong Yu
- Abstract summary: This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences.
We propose a new model architecture, referred to as Zebra, which efficiently manages the quadratic time and memory complexity issues associated with full attention.
Our model, akin to a zebra's alternating stripes, balances local and global attention layers, significantly reducing computational requirements and memory consumption.
- Score: 44.67973028541842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel approach to enhance the capabilities of Large
Language Models (LLMs) in processing and understanding extensive text
sequences, a critical aspect in applications requiring deep comprehension and
synthesis of large volumes of information. Recognizing the inherent challenges
in extending the context window for LLMs, primarily built on Transformer
architecture, we propose a new model architecture, referred to as Zebra. This
architecture efficiently manages the quadratic time and memory complexity
issues associated with full attention in the Transformer by employing grouped
local-global attention layers. Our model, akin to a zebra's alternating
stripes, balances local and global attention layers, significantly reducing
computational requirements and memory consumption. Comprehensive experiments,
including pretraining from scratch, continuation of long context adaptation
training, and long instruction tuning, are conducted to evaluate the Zebra's
performance. The results show that Zebra achieves comparable or superior
performance on both short and long sequence benchmarks, while also enhancing
training and inference efficiency.
Related papers
- On the Power of Convolution Augmented Transformer [30.46405043231576]
We study the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks.
Cat incorporates convolutional filters in the K/Q/V embeddings of an attention layer.
We show that the locality of the convolution synergizes with the global view of the attention.
arXiv Detail & Related papers (2024-07-08T04:08:35Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Advancing Transformer Architecture in Long-Context Large Language
Models: A Comprehensive Survey [18.930417261395906]
Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces, and dynamic agents.
This article offers a survey of the recent advancement in Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs.
arXiv Detail & Related papers (2023-11-21T04:59:17Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z) - Exploiting Explainable Metrics for Augmented SGD [43.00691899858408]
There are several unanswered questions about how learning under optimization really works and why certain strategies are better than others.
We propose new explainability metrics that measure the redundant information in a network's layers.
We then exploit these metrics to augment the Gradient Descent (SGD) by adaptively adjusting the learning rate in each layer to improve generalization performance.
arXiv Detail & Related papers (2022-03-31T00:16:44Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Deep Reinforced Self-Attention Masks for Abstractive Summarization
(DR.SAS) [0.0]
We present a novel architectural scheme to tackle the abstractive summarization problem based on the CNN/DMdataset.
We have tested the limits of learning fine-grained attention in Transformers to improve the summarization quality.
Our model tends to be more extractive/factual yet coherent in detail because of optimization over ROUGE rewards.
arXiv Detail & Related papers (2019-12-30T01:32:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.