Attention, Distillation, and Tabularization: Towards Practical Neural
Network-Based Prefetching
- URL: http://arxiv.org/abs/2401.06362v3
- Date: Thu, 22 Feb 2024 04:15:45 GMT
- Title: Attention, Distillation, and Tabularization: Towards Practical Neural
Network-Based Prefetching
- Authors: Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna
- Abstract summary: We propose a new approach that significantly reduces model complexity and inference latency without sacrificing prediction accuracy.
We develop DART, a prefetcher comprised of a simple hierarchy of tables.
DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement.
- Score: 6.692695353937492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based Neural Networks (NN) have demonstrated their effectiveness in
accurate memory access prediction, an essential step in data prefetching.
However, the substantial computational overheads associated with these models
result in high inference latency, limiting their feasibility as practical
prefetchers. To close the gap, we propose a new approach based on
tabularization that significantly reduces model complexity and inference
latency without sacrificing prediction accuracy. Our novel tabularization
methodology takes as input a distilled, yet highly accurate attention-based
model for memory access prediction and efficiently converts its expensive
matrix multiplications into a hierarchy of fast table lookups. As an exemplar
of the above approach, we develop DART, a prefetcher comprised of a simple
hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99%
of arithmetic operations from the large attention-based model and 91.83% from
the distilled model. DART accelerates the large model inference by 170x and the
distilled model by 9.4x. DART has comparable latency and storage costs as
state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC
improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch
by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its
low prefetching latency.
Related papers
- Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - AI-Accelerated Flow Simulation: A Robust Auto-Regressive Framework for Long-Term CFD Forecasting [2.3964255330849356]
We introduce the first implementation of the two-step derivative Adams-Bashforth method specifically tailored for data-driven AR prediction.<n>We develop three novel adaptive weighting strategies that dynamically adjust the importance of different future time steps.<n>Our framework accurately predicts 350 future steps reducing mean squared error from 0.125 to 0.002.
arXiv Detail & Related papers (2024-12-07T14:02:57Z) - Point Transformer V3: Simpler, Faster, Stronger [88.80496333515325]
This paper focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing.
We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms.
PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios.
arXiv Detail & Related papers (2023-12-15T18:59:59Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML
Prefetcher for Accelerating Graph Analytics [7.52191887022819]
We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models.
MPGraph three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access andtemporal prefetching.
Using CST, MPGraph achieves 12.52-21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.5-12.03% and ML-based prefetchers Voyager and TransFetch by 3.27-4.58%.
arXiv Detail & Related papers (2022-12-10T09:14:44Z) - Pushing the Limits of Asynchronous Graph-based Object Detection with
Event Cameras [62.70541164894224]
We introduce several architecture choices which allow us to scale the depth and complexity of such models while maintaining low computation.
Our method runs 3.7 times faster than a dense graph neural network, taking only 8.4 ms per forward pass.
arXiv Detail & Related papers (2022-11-22T15:14:20Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Masked Autoencoders Enable Efficient Knowledge Distillers [31.606287119666572]
This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders.
We minimize the distance between the intermediate feature map of the teacher model and that of the student model.
Our method can robustly distill knowledge from teacher models even with extremely high masking ratios.
arXiv Detail & Related papers (2022-08-25T17:58:59Z) - TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model.
The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model.
We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z) - Voxelmorph++ Going beyond the cranial vault with keypoint supervision
and multi-channel instance optimisation [8.88841928746097]
Recent Learn2Reg benchmark shows single-scale U-Net architectures fall short of state-of-the-art performance for abdominal or intra-patient lung registration.
Here, we propose two straightforward steps that greatly reduce this gap in accuracy.
First, we employ keypoint self-supervision with a novel network head that predicts a discretised heatmap.
Second, we replace multiple learned fine-tuning steps by a single instance with hand-crafted features and the Adam optimiser.
arXiv Detail & Related papers (2022-02-28T19:23:29Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.