LookupFFN: Making Transformers Compute-lite for CPU inference
- URL: http://arxiv.org/abs/2403.07221v1
- Date: Tue, 12 Mar 2024 00:26:16 GMT
- Title: LookupFFN: Making Transformers Compute-lite for CPU inference
- Authors: Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan
Sankaralingam, Vikas Singh
- Abstract summary: GPU clusters are the de facto choice for training large deep neural network (DNN) models today.
Several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry.
We study a module which is a workhorse within modern architectures, GEMM based Feed Forward Networks (FFNs) and assess the extent to which it can be made compute- (or FLOP-) lite.
- Score: 23.61144705380663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While GPU clusters are the de facto choice for training large deep neural
network (DNN) models today, several reasons including ease of workflow,
security and cost have led to efforts investigating whether CPUs may be viable
for inference in routine use in many sectors of the industry. But the imbalance
between the compute capabilities of GPUs and CPUs is huge. Motivated by these
considerations, we study a module which is a workhorse within modern DNN
architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent
to which it can be made compute- (or FLOP-) lite. Specifically, we propose an
alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by
the recent studies of using Locality Sensitive Hashing (LSH) to approximate
FFNs. Our formulation recasts most essential operations as a memory look-up,
leveraging the trade-off between the two resources on any platform: compute and
memory (since CPUs offer it in abundance). For RoBERTa language model
pretraining, our formulation achieves similar performance compared to GEMM
based FFNs, while dramatically reducing the required FLOP. Our development is
complemented with a detailed hardware profiling of strategies that will
maximize efficiency -- not just on contemporary hardware but on products that
will be offered in the near/medium term future. Code is avaiable at
\url{https://github.com/mlpen/LookupFFN}.
Related papers
- Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive.
This has sparked a research agenda to reduce these models' parameter count and computational costs without significantly impacting their performance.
We consider three candidate linear layer approximations in the FFN by combining efficient low-rank and block-diagonal matrices.
arXiv Detail & Related papers (2024-06-24T08:43:21Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - Real-time Hyper-Dimensional Reconfiguration at the Edge using Hardware
Accelerators [12.599871451119538]
HyDRATE can perform real-time reconfiguration at the edge using deep neural nets (DNN) combined with hyperdimensional (HD) computing accelerators.
We describe the algorithm, trained quantized model generation, and simulated performance of a feature extractor free of multiply-accumulates.
We show that reconfigurability in the field is achieved by retraining only the feed-forward HD classifier without descent gradient backpropagation.
arXiv Detail & Related papers (2022-06-10T14:08:41Z) - Hardware-Efficient Deconvolution-Based GAN for Edge Computing [1.5229257192293197]
Generative Adversarial Networks (GAN) are cutting-edge algorithms for generating new data samples based on the learned data distribution.
We proposed an HW/SW co-design approach for training quantized deconvolution GAN (QDCGAN) implemented on FPGA using a scalable streaming dataflow architecture.
Various precisions, datasets, and network scalability were analyzed for low-power inference on resource-constrained platforms.
arXiv Detail & Related papers (2022-01-18T11:16:59Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs,
GPUs and FPGAs [0.0]
StreamBrain is a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems.
We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds.
We are the first to demonstrate BCPNN on STL-10 size networks.
arXiv Detail & Related papers (2021-06-09T20:28:18Z) - iffDetector: Inference-aware Feature Filtering for Object Detection [70.8678270164057]
We introduce a generic Inference-aware Feature Filtering (IFF) module that can easily be combined with modern detectors.
IFF performs closed-loop optimization by leveraging high-level semantics to enhance the convolutional features.
IFF can be fused with CNN-based object detectors in a plug-and-play manner with negligible computational cost overhead.
arXiv Detail & Related papers (2020-06-23T02:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.