TransforMAP: Transformer for Memory Access Prediction
- URL: http://arxiv.org/abs/2205.14778v1
- Date: Sun, 29 May 2022 22:14:38 GMT
- Title: TransforMAP: Transformer for Memory Access Prediction
- Authors: Pengmiao Zhang, Ajitesh Srivastava, Anant V. Nori, Rajgopal Kannan,
Viktor K. Prasanna
- Abstract summary: Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program.
We develop TransforMAP, based on the powerful Transformer model, that can learn from the whole address space.
We show that our approach achieves 35.67% MPKI improvement, higher than state-of-the-art prefetcher and ISB prefetcher.
- Score: 10.128730975303407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data Prefetching is a technique that can hide memory latency by fetching data
before it is needed by a program. Prefetching relies on accurate memory access
prediction, to which task machine learning based methods are increasingly
applied. Unlike previous approaches that learn from deltas or offsets and
perform one access prediction, we develop TransforMAP, based on the powerful
Transformer model, that can learn from the whole address space and perform
multiple cache line predictions. We propose to use the binary of memory
addresses as model input, which avoids information loss and saves a token table
in hardware. We design a block index bitmap to collect unordered future page
offsets under the current page address as learning labels. As a result, our
model can learn temporal patterns as well as spatial patterns within a page. In
a practical implementation, this approach has the potential to hide prediction
latency because it prefetches multiple cache lines likely to be used in a long
horizon. We show that our approach achieves 35.67% MPKI improvement and 20.55%
IPC improvement in simulation, higher than state-of-the-art Best-Offset
prefetcher and ISB prefetcher.
Related papers
- PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.
We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.
Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Memory-and-Anticipation Transformer for Online Action Understanding [52.24561192781971]
We propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future.
We present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
arXiv Detail & Related papers (2023-08-15T17:34:54Z) - MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction [0.709016563801433]
MUSTACHE is a new page cache replacement whose logic is learned from observed memory access requests rather than fixed like existing policies.
We formulate the page request prediction problem as a categorical time series forecasting task.
Our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B'el'ady's replacement algorithm.
arXiv Detail & Related papers (2022-11-03T23:10:21Z) - A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes.
One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z) - Fine-Grained Address Segmentation for Attention-Based Variable-Degree
Prefetching [10.128730975303407]
We propose TransFetch, a novel way to model prefetching.
To reduce vocabulary size, we use fine-grained address segmentation as input.
To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs.
arXiv Detail & Related papers (2022-05-01T05:30:37Z) - Remember Intentions: Retrospective-Memory-based Trajectory Prediction [31.25007169374468]
We propose MemoNet, an instance-based approach that predicts the movement intentions of agents by looking for similar scenarios in the training data.
Experiments show that the proposed MemoNet improves the FDE by 20.3%/10.2%/28.3% from the previous best method on SDD/ETH-UCY/NBA datasets.
arXiv Detail & Related papers (2022-03-22T05:59:33Z) - MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction [26.151761714896118]
We address the problem of multimodal trajectory prediction exploiting a Memory Augmented Neural Network.
Our method learns past and future trajectory embeddings using recurrent neural networks and exploits an associative external memory to store and retrieve such embeddings.
Trajectory prediction is then performed by decoding in-memory future encodings conditioned with the observed past.
arXiv Detail & Related papers (2020-06-05T09:49:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.