HEAM : Hashed Embedding Acceleration using Processing-In-Memory
- URL: http://arxiv.org/abs/2402.04032v3
- Date: Thu, 14 Mar 2024 09:29:12 GMT
- Title: HEAM : Hashed Embedding Acceleration using Processing-In-Memory
- Authors: Youngsuk Kim, Hyuk-Jae Lee, Chae Eun Rhee,
- Abstract summary: In today's data centers, personalized recommendation systems face challenges such as the need for large memory capacity and high bandwidth.
Previous approaches have relied on DIMM-based near-memory processing techniques or introduced 3D-stacked DRAM to address memory-bound issues.
This paper introduces HEAM, a heterogeneous memory architecture that integrates 3D-stacked DRAM with DIMM to accelerate recommendation systems.
- Score: 17.66751227197112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In today's data centers, personalized recommendation systems face challenges such as the need for large memory capacity and high bandwidth, especially when performing embedding operations. Previous approaches have relied on DIMM-based near-memory processing techniques or introduced 3D-stacked DRAM to address memory-bound issues and expand memory bandwidth. However, these solutions fall short when dealing with the expanding size of personalized recommendation systems. Recommendation models have grown to sizes exceeding tens of terabytes, making them challenging to run efficiently on traditional single-node inference servers. Although various algorithmic methods have been proposed to reduce embedding table capacity, they often result in increased memory access or inefficient utilization of memory resources. This paper introduces HEAM, a heterogeneous memory architecture that integrates 3D-stacked DRAM with DIMM to accelerate recommendation systems in which compositional embedding is utilized-a technique aimed at reducing the size of embedding tables. The architecture is organized into a three-tier memory hierarchy consisting of conventional DIMM, 3D-stacked DRAM with a base die-level Processing-In-Memory (PIM), and a bank group-level PIM incorporating lookup tables. This setup is specifically designed to accommodate the unique aspects of compositional embedding, such as temporal locality and embedding table capacity. This design effectively reduces bank access, improves access efficiency, and enhances overall throughput, resulting in a 6.3 times speedup and 58.9% energy savings compared to the baseline.
Related papers
- HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices [1.8749305679160366]
This study introduces a Heterogeneous-Hybrid PIM (HH-PIM) architecture, comprising high-performance MRAM-SRAM PIM modules and low-power MRAM-SRAM PIM modules.
We show that the proposed HH-PIM achieves up to $60.43$ percent average energy savings over conventional PIMs while meeting application requirements.
arXiv Detail & Related papers (2025-04-02T08:22:32Z) - Enabling Low-Cost Secure Computing on Untrusted In-Memory Architectures [5.565715369147691]
Processing-in-Memory (PIM) promises to substantially improve performance by moving processing closer to the data.
Integrating PIM modules within a secure computing system raises an interesting challenge: unencrypted data has to move off-chip to the PIM, exposing the data to attackers and breaking assumptions on Trusted Computing Bases (TCBs)
This paperleverages multi-party computation (MPC) techniques, specifically arithmetic secret sharing and Yao's garbled circuits, to outsource bandwidth-intensive computation securely to PIM.
arXiv Detail & Related papers (2025-01-28T20:48:14Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - BoA: Attention-aware Post-training Quantization without Backpropagation [11.096116957844014]
Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices.
We introduce a novel backpropagation-free PTQ algorithm that optimize integer weights by considering inter-layer dependencies.
arXiv Detail & Related papers (2024-06-19T11:53:21Z) - CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory
Architectures [0.1747623282473278]
We present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures.
We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms.
arXiv Detail & Related papers (2024-01-15T13:35:21Z) - EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z) - SimplePIM: A Software Framework for Productive and Efficient
Processing-in-Memory [8.844860045305772]
The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips.
This paper presents a new software framework, SimplePIM, to aid programming real PIM systems.
We implement SimplePIM for the UPMEM PIM system and evaluate it on six major applications.
arXiv Detail & Related papers (2023-10-03T08:59:39Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Minimizing Age of Information for Mobile Edge Computing Systems: A
Nested Index Approach [11.998034941401814]
Mobile edge computation (MEC) provides an efficient approach to achieving real-time applications that are sensitive to information freshness.
In this paper, we consider multiple users offloading tasks to heterogeneous edge servers in a MEC system.
Our algorithm leads to an optimality gap reduction of up to 40%, compared to benchmarks.
arXiv Detail & Related papers (2023-07-03T21:47:21Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - Distributed Reinforcement Learning for Privacy-Preserving Dynamic Edge
Caching [91.50631418179331]
A privacy-preserving distributed deep policy gradient (P2D3PG) is proposed to maximize the cache hit rates of devices in the MEC networks.
We convert the distributed optimizations into model-free Markov decision process problems and then introduce a privacy-preserving federated learning method for popularity prediction.
arXiv Detail & Related papers (2021-10-20T02:48:27Z) - Continual Learning Approach for Improving the Data and Computation
Mapping in Near-Memory Processing System [3.202860612193139]
We propose an artificially intelligent memory mapping scheme, AIMM, that optimize data placement and resource utilization through page and computation remapping.
AIMM uses a neural network to achieve a near-optimal mapping during execution, trained using a reinforcement learning algorithm.
Our experimental evaluation shows that AIMM improves the baseline NMP performance in single and multiple program scenario by up to 70% and 50%, respectively.
arXiv Detail & Related papers (2021-04-28T09:50:35Z) - Covert Model Poisoning Against Federated Learning: Algorithm Design and
Optimization [76.51980153902774]
Federated learning (FL) is vulnerable to external attacks on FL models during parameters transmissions.
In this paper, we propose effective MP algorithms to combat state-of-the-art defensive aggregation mechanisms.
Our experimental results demonstrate that the proposed CMP algorithms are effective and substantially outperform existing attack mechanisms.
arXiv Detail & Related papers (2021-01-28T03:28:18Z) - Reinforcement Learning Based Cooperative Coded Caching under Dynamic
Popularities in Ultra-Dense Networks [38.44125997148742]
caching strategy at small base stations (SBSs) is critical to meet massive high data rate requests.
We exploit reinforcement learning (RL) to design a cooperative caching strategy with maximum-distance separable (MDS) coding.
arXiv Detail & Related papers (2020-03-08T10:45:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.