SimplePIM: A Software Framework for Productive and Efficient
Processing-in-Memory
- URL: http://arxiv.org/abs/2310.01893v1
- Date: Tue, 3 Oct 2023 08:59:39 GMT
- Title: SimplePIM: A Software Framework for Productive and Efficient
Processing-in-Memory
- Authors: Jinfan Chen, Juan G\'omez-Luna, Izzat El Hajj, Yuxin Guo, Onur Mutlu
- Abstract summary: The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips.
This paper presents a new software framework, SimplePIM, to aid programming real PIM systems.
We implement SimplePIM for the UPMEM PIM system and evaluate it on six major applications.
- Score: 8.844860045305772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data movement between memory and processors is a major bottleneck in modern
computing systems. The processing-in-memory (PIM) paradigm aims to alleviate
this bottleneck by performing computation inside memory chips. Real PIM
hardware (e.g., the UPMEM system) is now available and has demonstrated
potential in many applications. However, programming such real PIM hardware
remains a challenge for many programmers.
This paper presents a new software framework, SimplePIM, to aid programming
real PIM systems. The framework processes arrays of arbitrary elements on a PIM
device by calling iterator functions from the host and provides primitives for
communication among PIM cores and between PIM and the host system. We implement
SimplePIM for the UPMEM PIM system and evaluate it on six major applications.
Our results show that SimplePIM enables 66.5% to 83.1% reduction in lines of
code in PIM programs. The resulting code leads to higher performance (between
10% and 37% speedup) than hand-optimized code in three applications and
provides comparable performance in three others. SimplePIM is fully and freely
available at https://github.com/CMU-SAFARI/SimplePIM.
Related papers
- MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM)
MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading.
On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z) - Reinforcement Learning for Long-Horizon Interactive LLM Agents [56.9860859585028]
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests.
We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments.
We derive LOOP, a data- and memory-efficient variant of proximal policy optimization.
arXiv Detail & Related papers (2025-02-03T18:35:42Z) - LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System [6.21613161960432]
Large language models (LLMs) process sequences of tens of thousands of tokens.
processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data.
LoL-PIM is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design.
arXiv Detail & Related papers (2024-12-28T14:38:16Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Dynamic neural network with memristive CIM and CAM for 2D and 3D vision [57.6208980140268]
We propose a semantic memory-based dynamic neural network (DNN) using memristor.
The network associates incoming data with the past experience stored as semantic vectors.
We validate our co-designs, using a 40nm memristor macro, on ResNet and PointNet++ for classifying images and 3D points from the MNIST and ModelNet datasets.
arXiv Detail & Related papers (2024-07-12T04:55:57Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures [10.047157906258196]
We introduce PyGim, an efficient ML library that accelerates Graph Neural Networks on real PIM systems.
We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric systems.
We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x.
arXiv Detail & Related papers (2024-02-26T16:52:35Z) - ProactivePIM: Accelerating Weight-Sharing Embedding Layer with PIM for Scalable Recommendation System [16.2798383044926]
Weight-sharing algorithms have been proposed for size reduction, but they increase memory access.
Recent advancements in processing-in-memory (PIM) enhanced the model throughput by exploiting memory parallelism.
We propose ProactivePIM, a PIM system for weight-sharing recommendation system acceleration.
arXiv Detail & Related papers (2024-02-06T14:26:22Z) - EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - TransPimLib: A Library for Efficient Transcendental Functions on
Processing-in-Memory Systems [8.440839526313797]
We present emphTransPimLib, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib's methods in terms of performance and accuracy.
arXiv Detail & Related papers (2023-04-03T12:41:46Z) - An Experimental Evaluation of Machine Learning Training on a Real
Processing-in-Memory System [9.429605859159023]
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems, with processing-in-memory capabilities, can alleviate this data movement bottleneck.
We implement several representative classic ML algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-07-16T09:39:53Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - UNeXt: MLP-based Rapid Medical Image Segmentation Network [80.16644725886968]
UNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years.
We propose UNeXt which is a Convolutional multilayer perceptron based network for image segmentation.
We show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance.
arXiv Detail & Related papers (2022-03-09T18:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.