Related papers: ERCache: An Efficient and Reliable Caching Framework for Large-Scale User Representations in Meta's Ads System

ERCache: An Efficient and Reliable Caching Framework for Large-Scale User Representations in Meta's Ads System

URL: http://arxiv.org/abs/2410.06497v1
Date: Wed, 9 Oct 2024 02:51:27 GMT
Title: ERCache: An Efficient and Reliable Caching Framework for Large-Scale User Representations in Meta's Ads System
Authors: Fang Zhou, Yaning Huang, Dong Liang, Dai Li, Zhongke Zhang, Kai Wang, Xiao Xin, Abdallah Aboelela, Zheliang Jiang, Yang Wang, Jeff Song, Wei Zhang, Chen Liang, Huayu Li, ChongLin Sun, Hang Yang, Lei Qu, Zhan Shu, Mindi Yuan, Emanuele Maccherani, Taha Hayat, John Guo, Varna Puvvada, Uladzimir Pashkevich,
Abstract summary: ERCache is an efficient and robust caching framework for large-scale user representations in ads recommendation systems on social networks. ERCache has been deployed at Meta for over six months, supporting more than 30 ranking models.
Score: 20.058668768238984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing complexity of deep learning models used for calculating user representations presents significant challenges, particularly with limited computational resources and strict service-level agreements (SLAs). Previous research efforts have focused on optimizing model inference but have overlooked a critical question: is it necessary to perform user model inference for every ad request in large-scale social networks? To address this question and these challenges, we first analyze user access patterns at Meta and find that most user model inferences occur within a short timeframe. T his observation reveals a triangular relationship among model complexity, embedding freshness, and service SLAs. Building on this insight, we designed, implemented, and evaluated ERCache, an efficient and robust caching framework for large-scale user representations in ads recommendation systems on social networks. ERCache categorizes cache into direct and failover types and applies customized settings and eviction policies for each model, effectively balancing model complexity, embedding freshness, and service SLAs, even considering the staleness introduced by caching. ERCache has been deployed at Meta for over six months, supporting more than 30 ranking models while efficiently conserving computational resources and complying with service SLA requirements.

Related papers

Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z)
DiCache: Let Diffusion Model Determine Its Own Cache [62.954717254728166]
DiCache is a training-free adaptive caching strategy for accelerating diffusion models at runtime.<n>Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time.<n> Dynamic Cache Trajectory Alignment approximates the deep-layer feature output from multi-step historical caches.
arXiv Detail & Related papers (2025-08-24T13:30:00Z)
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z)
Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z)
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider [15.532112534717262]
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency.<n>We present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers.<n>We propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
arXiv Detail & Related papers (2025-06-03T08:51:38Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Our approach only leverages a small number of samples to search for the desired pruning policy. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation. DiTs come with significant drawbacks, including increased computational and memory costs. We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z)
IC-Cache: Efficient Large Language Model Serving via In-context Caching [16.75800945078601]
IC-Cache is a caching system that enables live capability augmentation to improve serving efficiency.<n>We show that IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without hurting response quality.
arXiv Detail & Related papers (2025-01-22T07:52:38Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z)
TrimCaching: Parameter-sharing AI Model Caching in Wireless Edge Networks [36.39118138582416]
Next-generation mobile networks are expected to facilitate fast AI model downloading to end users. By caching models on edge servers, mobile networks can deliver models to end users with low latency. We develop a novel model placement scheme, called parameter-sharing model caching (TrimCaching)
arXiv Detail & Related papers (2024-05-07T04:08:49Z)
A Learning-Based Caching Mechanism for Edge Content Delivery [2.412158290827225]
5G networks and the rise of the Internet of Things (IoT) are increasingly extending into the network edge. This shift introduces unique challenges, particularly due to the limited cache storage and the diverse request patterns at the edge. We introduce HR-Cache, a learning-based caching framework grounded in the principles of Hazard Rate (HR) ordering.
arXiv Detail & Related papers (2024-02-05T08:06:03Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting [71.38754976584009]
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image. We propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (Net) It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size.
arXiv Detail & Related papers (2023-11-16T16:50:56Z)
Improving the Performance of DNN-based Software Services using Automated Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. The computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks. To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache. This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)
Artificial Intelligence Assisted Collaborative Edge Caching in Small Cell Networks [19.605382256630538]
This paper considers heterogeneous content preference of the users with heterogeneous caching models at the edge nodes. We propose a modified particle swarm optimization (M-PSO) algorithm that efficiently solves the complex constraint problem in a reasonable time.
arXiv Detail & Related papers (2020-05-16T10:39:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.