Related papers: RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference

Related papers

SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models [17.602518628415776]
Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. With improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. We propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs.
arXiv Detail & Related papers (2025-04-01T08:12:45Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation [58.194356020695906]
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications.
arXiv Detail & Related papers (2025-02-20T22:35:52Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [2.7309692684728613]
Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges. We present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations.
arXiv Detail & Related papers (2025-01-14T15:14:10Z)
An Efficient and Streaming Audio Visual Active Speaker Detection System [2.4515389321702132]
We present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference.
arXiv Detail & Related papers (2024-09-13T17:45:53Z)
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI. The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache. Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios. We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs) InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z)
Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models. LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z)
Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks. By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z)
UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture [6.5386984667643695]
UpDLRM uses real-world processingin-memory hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
arXiv Detail & Related papers (2024-06-20T02:20:21Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled. This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit. We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
Understanding Capacity-Driven Scale-Out Neural Recommendation Inference [1.9529164002361878]
This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution. Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
arXiv Detail & Related papers (2020-11-04T00:51:40Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.