Related papers: An Enhanced Batch Query Architecture in Real-time Recommendation

An Enhanced Batch Query Architecture in Real-time Recommendation

URL: http://arxiv.org/abs/2409.00400v1
Date: Sat, 31 Aug 2024 09:19:41 GMT
Title: An Enhanced Batch Query Architecture in Real-time Recommendation
Authors: Qiang Zhang, Zhipeng Teng, Disheng Wu, Jiayin Wang,
Abstract summary: In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests. We have designed and implemented a high-performance batch query architecture for real-time recommendation systems. This architecture has been deployed and in use in the bilibili recommendation system for over a year, supporting 10x increase in model with minimal resource growth.
Score: 9.073405491915198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system's real-time performance.

Related papers

Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding [0.0]
We present a framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. Our framework integrates two complementary techniques: Policy-d RetrievalAugmented Generation (PORAG) and Adaptive Token-Layer Attention Scoring (ATLAS) Our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems.
arXiv Detail & Related papers (2025-04-02T01:16:10Z)
ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics. Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation [1.3450852784287828]
Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge. RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. We introduce Proximity, an approximate key-value cache that optimize the RAG workflow by leveraging similarities in user queries.
arXiv Detail & Related papers (2025-03-07T15:54:04Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
Dynamic Optimization of Storage Systems Using Reinforcement Learning Techniques [40.13303683102544]
This paper introduces RL-Storage, a reinforcement learning-based framework designed to dynamically optimize storage system configurations. RL-Storage learns from real-time I/O patterns and predicts optimal storage parameters, such as cache size, queue depths, and readahead settings. It achieves throughput gains of up to 2.6x and latency reductions of 43% compared to baselines.
arXiv Detail & Related papers (2024-12-29T17:41:40Z)
Novel Architecture for Distributed Travel Data Integration and Service Provision Using Microservices [1.03590082373586]
This paper introduces an architecture for enhancing the flexibility and performance of an airline reservation system. The design incorporates Redis cache technologies, two different messaging systems (Kafka and RabbitMQ), two types of architectural storages (MongoDB, and Docker) The architecture provides an impressive level of data consistency at 99.5% and a latency of data propagation of less than 75 ms.
arXiv Detail & Related papers (2024-10-31T17:41:14Z)
Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z)
EASRec: Elastic Architecture Search for Efficient Long-term Sequential Recommender Systems [82.76483989905961]
Current Sequential Recommender Systems (SRSs) suffer from computational and resource inefficiencies. We develop the Elastic Architecture Search for Efficient Long-term Sequential Recommender Systems (EASRec) EASRec introduces data-aware gates that leverage historical information from input data batch to improve the performance of the recommendation network.
arXiv Detail & Related papers (2024-02-01T07:22:52Z)
Efficient Architecture Search via Bi-level Data Pruning [70.29970746807882]
This work pioneers an exploration into the critical role of dataset characteristics for DARTS bi-level optimization. We introduce a new progressive data pruning strategy that utilizes supernet prediction dynamics as the metric. Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50%.
arXiv Detail & Related papers (2023-12-21T02:48:44Z)
Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction [17.94189417448127]
We propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA) for real-time text prediction. It efficiently combines a cloud-based large language model with a smaller client-side model through retrieval augmented memory. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
arXiv Detail & Related papers (2023-08-08T12:27:20Z)
HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization [3.153934519625761]
We develop a novel variational-autoencoder-guided asynchronous Bayesian optimization method to tune HPC storage service parameters. We implement our approach within the DeepHyper open-source framework, and apply it to the autotuning of a high-energy physics workflow on Argonne's Theta supercomputer. Our approach is on par with state-of-the-art autotuning frameworks in speed and outperforms them in resource utilization and parallelization capabilities.
arXiv Detail & Related papers (2022-10-03T10:12:57Z)
BagPipe: Accelerating Deep Recommendation Model Training [9.911467752221863]
Bagpipe is a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions.
arXiv Detail & Related papers (2022-02-24T23:54:12Z)
Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME. PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z)
DHA: End-to-End Joint Optimization of Data Augmentation Policy, Hyper-parameter and Architecture [81.82173855071312]
We propose an end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search. Dha achieves state-of-the-art (SOTA) results on various datasets, especially 77.4% accuracy on ImageNet with cell based search space.
arXiv Detail & Related papers (2021-09-13T08:12:50Z)
Fast Class-wise Updating for Online Hashing [196.14748396106955]
This paper presents a novel supervised online hashing scheme, termed Fast Class-wise Updating for Online Hashing (FCOH) A class-wise updating method is developed to decompose the binary code learning and alternatively renew the hash functions in a class-wise fashion, which well addresses the burden on large amounts of training batches. To further achieve online efficiency, we propose a semi-relaxation optimization, which accelerates the online training by treating different binary constraints independently.
arXiv Detail & Related papers (2020-12-01T07:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.