An Experimental Evaluation of Machine Learning Training on a Real
Processing-in-Memory System
- URL: http://arxiv.org/abs/2207.07886v5
- Date: Tue, 5 Sep 2023 20:04:41 GMT
- Title: An Experimental Evaluation of Machine Learning Training on a Real
Processing-in-Memory System
- Authors: Juan G\'omez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy
Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu
- Abstract summary: Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems, with processing-in-memory capabilities, can alleviate this data movement bottleneck.
We implement several representative classic ML algorithms on a real-world general-purpose PIM architecture.
- Score: 9.429605859159023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is $27\times$ faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference [2.9302211589186244]
Large language models (LLMs) have transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations.
Developments in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law.
compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory.
arXiv Detail & Related papers (2024-06-12T16:57:58Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures [10.047157906258196]
We introduce PyGim, an efficient ML library that accelerates Graph Neural Networks on real PIM systems.
We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric systems.
We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x.
arXiv Detail & Related papers (2024-02-26T16:52:35Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Machine Learning Training on a Real Processing-in-Memory System [9.286176889576996]
Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems with processing-in-memory capabilities can alleviate this data movement bottleneck.
Our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-06-13T10:20:23Z) - Heterogeneous Data-Centric Architectures for Modern Data-Intensive
Applications: Case Studies in Machine Learning and Databases [9.927754948343326]
processing-in-memory (PIM) is a promising execution paradigm that alleviates the data movement bottleneck in modern applications.
In this paper, we show how to take advantage of the PIM paradigm for two modern data-intensive applications.
arXiv Detail & Related papers (2022-05-29T13:43:17Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.