PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory
Access Prediction Models
- URL: http://arxiv.org/abs/2402.13441v1
- Date: Wed, 21 Feb 2024 00:24:34 GMT
- Title: PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory
Access Prediction Models
- Authors: Neelesh Gupta, Pengmiao Zhang, Rajgopal Kannan and Viktor Prasanna
- Abstract summary: PaCKD is a pattern-Clustered Knowledge Distillation approach to compress MAP models.
PaCKD yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.
- Score: 2.404163279345609
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep neural networks (DNNs) have proven to be effective models for accurate
Memory Access Prediction (MAP), a critical task in mitigating memory latency
through data prefetching. However, existing DNN-based MAP models suffer from
the challenges such as significant physical storage space and poor inference
latency, primarily due to their large number of parameters. These limitations
render them impractical for deployment in real-world scenarios. In this paper,
we propose PaCKD, a Pattern-Clustered Knowledge Distillation approach to
compress MAP models while maintaining the prediction performance. The PaCKD
approach encompasses three steps: clustering memory access sequences into
distinct partitions involving similar patterns, training large pattern-specific
teacher models for memory access prediction for each partition, and training a
single lightweight student model by distilling the knowledge from the trained
pattern-specific teachers. We evaluate our approach on LSTM, MLP-Mixer, and
ResNet models, as they exhibit diverse structures and are widely used for image
classification tasks in order to test their effectiveness in four widely used
graph applications. Compared to the teacher models with 5.406M parameters and
an F1-score of 0.4626, our student models achieve a 552$\times$ model size
compression while maintaining an F1-score of 0.4538 (with a 1.92% performance
drop). Our approach yields an 8.70% higher result compared to student models
trained with standard knowledge distillation and an 8.88% higher result
compared to student models trained without any form of knowledge distillation.
Related papers
- MIRACLE 3D: Memory-efficient Integrated Robust Approach for Continual Learning on Point Clouds via Shape Model construction [0.4604003661048266]
We introduce a novel framework for memory-efficient and privacy-preserving continual learning in 3D object classification.
Our method constructs a compact shape model for each class, retaining only the mean shape along with a few key modes of variation.
We validate our approach through extensive experiments on the ModelNet40, ShapeNet, and ScanNet datasets.
arXiv Detail & Related papers (2024-10-08T23:12:33Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics [4.220363193932374]
We propose an efficient cosine similarity-based classification difficulty measure S.
It is calculated from the number of classes and intra- and inter-class similarity metrics of the dataset.
We show how a practitioner can use this measure to help select an efficient model 6 to 29x faster than through repeated training and testing.
arXiv Detail & Related papers (2024-04-09T03:27:09Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters.
Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK)
Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Multiple Run Ensemble Learning withLow-Dimensional Knowledge Graph
Embeddings [4.317340121054659]
We propose a simple but effective performance boosting strategy for knowledge graph embedding (KGE) models.
We repeat the training of a model 6 times in parallel with an embedding size of 200 and then combine the 6 separate models for testing.
We show that our approach enables different models to better cope with their issues on modeling various graph patterns.
arXiv Detail & Related papers (2021-04-11T12:26:50Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.