RecSSD: Near Data Processing for Solid State Drive Based Recommendation
Inference
- URL: http://arxiv.org/abs/2102.00075v1
- Date: Fri, 29 Jan 2021 21:25:34 GMT
- Title: RecSSD: Near Data Processing for Solid State Drive Based Recommendation
Inference
- Authors: Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean
Wu, David Brooks, Gu-Yeon Wei
- Abstract summary: RecSSD is a near data processing based SSD memory system customized for neural recommendation.
It reduces end-to-end model inference latency by 2X compared to using COTS across eight industry-representative models.
- Score: 7.3762607002135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural personalized recommendation models are used across a wide variety of
datacenter applications including search, social media, and entertainment.
State-of-the-art models comprise large embedding tables that have billions of
parameters requiring large memory capacities. Unfortunately, large and fast
DRAM-based memories levy high infrastructure costs. Conventional SSD-based
storage solutions offer an order of magnitude larger capacity, but have worse
read latency and bandwidth, degrading inference performance. RecSSD is a near
data processing based SSD memory system customized for neural recommendation
inference that reduces end-to-end model inference latency by 2X compared to
using COTS SSDs across eight industry-representative models.
Related papers
- An Efficient and Streaming Audio Visual Active Speaker Detection System [2.4515389321702132]
We present two scenarios that address the key challenges posed by real-time constraints.
First, we introduce a method to limit the number of future context frames utilized by the ASD model.
Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference.
arXiv Detail & Related papers (2024-09-13T17:45:53Z) - InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI.
The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache.
Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios.
We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs)
InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z) - Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models.
LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z) - Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks.
By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z) - UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture [6.5386984667643695]
UpDLRM uses real-world processingin-memory hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency.
UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
arXiv Detail & Related papers (2024-06-20T02:20:21Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - HEAM : Hashed Embedding Acceleration using Processing-In-Memory [17.66751227197112]
In today's data centers, personalized recommendation systems face challenges such as the need for large memory capacity and high bandwidth.
Previous approaches have relied on DIMM-based near-memory processing techniques or introduced 3D-stacked DRAM to address memory-bound issues.
This paper introduces HEAM, a heterogeneous memory architecture that integrates 3D-stacked DRAM with DIMM to accelerate recommendation systems.
arXiv Detail & Related papers (2024-02-06T14:26:22Z) - Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled.
This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit.
We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Understanding Capacity-Driven Scale-Out Neural Recommendation Inference [1.9529164002361878]
This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure.
We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution.
Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
arXiv Detail & Related papers (2020-11-04T00:51:40Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.