Related papers: UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Related papers

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference [6.886434948681708]
Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention. We propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices.
arXiv Detail & Related papers (2025-04-24T14:14:07Z)
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM) MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z)
SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models [17.602518628415776]
Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. With improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. We propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs.
arXiv Detail & Related papers (2025-04-01T08:12:45Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training. Various memory-efficient Scals have been proposed to reduce memory usage. They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations. For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z)
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs [13.720423381263409]
We show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading to a 3.2x embedding-only performance slowdown. We propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
arXiv Detail & Related papers (2024-10-29T17:13:54Z)
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI. The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache. Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios. We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs) InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
MTrainS: Improving DLRM training efficiency using heterogeneous memories [5.195887979684162]
In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically.
arXiv Detail & Related papers (2023-04-19T06:06:06Z)
A Frequency-aware Software Cache for Large Recommendation System Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space. Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z)
Supporting Massive DLRM Inference Through Software Defined Memory [18.52744448265802]
Deep Learning Recommendation Models (DLRM) are widespread, account for a considerable data center footprint, and grow by more than 1.5x per year. With model size soon to be in terabytes range, leveraging Storage ClassMemory (SCM) for inference enables lower power consumption and cost.
arXiv Detail & Related papers (2021-10-21T21:29:06Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.