Related papers: Cached Transformers: Improving Transformers with Differentiable Memory Cache

Cached Transformers: Improving Transformers with Differentiable Memory Cache

URL: http://arxiv.org/abs/2312.12742v1
Date: Wed, 20 Dec 2023 03:30:51 GMT
Title: Cached Transformers: Improving Transformers with Differentiable Memory Cache
Authors: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
Abstract summary: This work introduces a new Transformer model called Cached Transformer. It uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.
Score: 71.28188777209034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

Related papers

CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR) CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [0.5899781520375794]
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state. InfiniGen is a novel KV cache management framework tailored for long-text generation.
arXiv Detail & Related papers (2024-06-28T07:41:26Z)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers. Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Spatially-Aware Transformer for Embodied Agents [20.498778205143477]
This paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. We demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. We also propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning.
arXiv Detail & Related papers (2024-02-23T07:46:30Z)
Recurrent Action Transformer with Memory [39.58317527488534]
This paper proposes a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. We conduct experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments.
arXiv Detail & Related papers (2023-06-15T19:29:08Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens. LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length. Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z)
HM4: Hidden Markov Model with Memory Management for Visual Place Recognition [54.051025148533554]
We develop a Hidden Markov Model approach for visual place recognition in autonomous driving. Our algorithm, dubbed HM$4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory. We show that this allows constant time and space inference for a fixed coverage area.
arXiv Detail & Related papers (2020-11-01T08:49:24Z)
Memory Transformer [0.31406146587437894]
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
arXiv Detail & Related papers (2020-06-20T09:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.