LookupFFN: Making Transformers Compute-lite for CPU inference
- URL: http://arxiv.org/abs/2403.07221v1
- Date: Tue, 12 Mar 2024 00:26:16 GMT
- Title: LookupFFN: Making Transformers Compute-lite for CPU inference
- Authors: Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan
Sankaralingam, Vikas Singh
- Abstract summary: GPU clusters are the de facto choice for training large deep neural network (DNN) models today.
Several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry.
We study a module which is a workhorse within modern architectures, GEMM based Feed Forward Networks (FFNs) and assess the extent to which it can be made compute- (or FLOP-) lite.
- Score: 23.61144705380663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While GPU clusters are the de facto choice for training large deep neural
network (DNN) models today, several reasons including ease of workflow,
security and cost have led to efforts investigating whether CPUs may be viable
for inference in routine use in many sectors of the industry. But the imbalance
between the compute capabilities of GPUs and CPUs is huge. Motivated by these
considerations, we study a module which is a workhorse within modern DNN
architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent
to which it can be made compute- (or FLOP-) lite. Specifically, we propose an
alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by
the recent studies of using Locality Sensitive Hashing (LSH) to approximate
FFNs. Our formulation recasts most essential operations as a memory look-up,
leveraging the trade-off between the two resources on any platform: compute and
memory (since CPUs offer it in abundance). For RoBERTa language model
pretraining, our formulation achieves similar performance compared to GEMM
based FFNs, while dramatically reducing the required FLOP. Our development is
complemented with a detailed hardware profiling of strategies that will
maximize efficiency -- not just on contemporary hardware but on products that
will be offered in the near/medium term future. Code is avaiable at
\url{https://github.com/mlpen/LookupFFN}.
Related papers
- Enhancing MOTION2NX for Efficient, Scalable and Secure Image Inference using Convolutional Neural Networks [4.407841002228536]
We use the ABY2.0 SMPC protocol implemented on the C++ based MOTION2NX framework for secure convolutional neural network (CNN) inference application with semi-honest security.
We also present a novel splitting algorithm that divides the computations at each CNN layer into multiple chunks.
arXiv Detail & Related papers (2024-08-29T09:50:21Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge.
In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - Real-time Hyper-Dimensional Reconfiguration at the Edge using Hardware
Accelerators [12.599871451119538]
HyDRATE can perform real-time reconfiguration at the edge using deep neural nets (DNN) combined with hyperdimensional (HD) computing accelerators.
We describe the algorithm, trained quantized model generation, and simulated performance of a feature extractor free of multiply-accumulates.
We show that reconfigurability in the field is achieved by retraining only the feed-forward HD classifier without descent gradient backpropagation.
arXiv Detail & Related papers (2022-06-10T14:08:41Z) - Hardware-Efficient Deconvolution-Based GAN for Edge Computing [1.5229257192293197]
Generative Adversarial Networks (GAN) are cutting-edge algorithms for generating new data samples based on the learned data distribution.
We proposed an HW/SW co-design approach for training quantized deconvolution GAN (QDCGAN) implemented on FPGA using a scalable streaming dataflow architecture.
Various precisions, datasets, and network scalability were analyzed for low-power inference on resource-constrained platforms.
arXiv Detail & Related papers (2022-01-18T11:16:59Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs,
GPUs and FPGAs [0.0]
StreamBrain is a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems.
We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds.
We are the first to demonstrate BCPNN on STL-10 size networks.
arXiv Detail & Related papers (2021-06-09T20:28:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.