TransPimLib: A Library for Efficient Transcendental Functions on
Processing-in-Memory Systems
- URL: http://arxiv.org/abs/2304.01951v5
- Date: Tue, 5 Sep 2023 19:55:48 GMT
- Title: TransPimLib: A Library for Efficient Transcendental Functions on
Processing-in-Memory Systems
- Authors: Maurus Item, Juan G\'omez-Luna, Yuxin Guo, Geraldo F. Oliveira,
Mohammad Sadrosadati, Onur Mutlu
- Abstract summary: We present emphTransPimLib, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib's methods in terms of performance and accuracy.
- Score: 8.440839526313797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Processing-in-memory (PIM) promises to alleviate the data movement bottleneck
in modern computing systems. However, current real-world PIM systems have the
inherent disadvantage that their hardware is more constrained than in
conventional processors (CPU, GPU), due to the difficulty and cost of building
processing elements near or inside the memory. As a result, general-purpose PIM
architectures support fairly limited instruction sets and struggle to execute
complex operations such as transcendental functions and other hard-to-calculate
operations (e.g., square root). These operations are particularly important for
some modern workloads, e.g., activation functions in machine learning
applications.
In order to provide support for transcendental (and other hard-to-calculate)
functions in general-purpose PIM systems, we present \emph{TransPimLib}, a
library that provides CORDIC-based and LUT-based methods for trigonometric
functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and
perform a thorough evaluation of TransPimLib's methods in terms of performance
and accuracy, using microbenchmarks and three full workloads (Blackscholes,
Sigmoid, Softmax). We open-source all our code and datasets
at~\url{https://github.com/CMU-SAFARI/transpimlib}.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads [16.67441258454545]
Processing-in-memory (PIM) has emerged as an enabler for the energy-efficient and high-performance acceleration of deep learning (DL) workloads.
Resistive random-access memory (ReRAM) is one of the most promising technologies to implement PIM.
Existing PIM-based architectures mostly focus on computation while ignoring the role of communication.
arXiv Detail & Related papers (2024-03-28T00:29:15Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Constant Memory Attention Block [74.38724530521277]
Constant Memory Attention Block (CMAB) is a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation.
We show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.
arXiv Detail & Related papers (2023-06-21T22:41:58Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - An Experimental Evaluation of Machine Learning Training on a Real
Processing-in-Memory System [9.429605859159023]
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems, with processing-in-memory capabilities, can alleviate this data movement bottleneck.
We implement several representative classic ML algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-07-16T09:39:53Z) - Machine Learning Training on a Real Processing-in-Memory System [9.286176889576996]
Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems with processing-in-memory capabilities can alleviate this data movement bottleneck.
Our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-06-13T10:20:23Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - DFTpy: An efficient and object-oriented platform for orbital-free DFT
simulations [55.41644538483948]
In this work, we present DFTpy, an open source software implementing OFDFT written entirely in Python 3.
We showcase the electronic structure of a million-atom system of aluminum metal which was computed on a single CPU.
DFTpy is released under the MIT license.
arXiv Detail & Related papers (2020-02-07T19:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.