Related papers: xGR: Efficient Generative Recommendation Serving at Scale

xGR: Efficient Generative Recommendation Serving at Scale

URL: http://arxiv.org/abs/2512.11529v2
Date: Fri, 19 Dec 2025 11:20:16 GMT
Title: xGR: Efficient Generative Recommendation Serving at Scale
Authors: Qingxiao Sun, Tongxuan Liu, Shen Zhang, Siyu Wu, Peijun Yang, Haotian Liang, Menxin Li, Xiaolong Ma, Zhiwei Liang, Ziyi Ren, Minchao Zhang, Xinyu Liu, Ke Zhang, Depei Qian, Hailong Yang,
Abstract summary: We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios.<n>xGR unifies the processing of prefill and decode phases through staged and separated KV cache.<n>Experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline.
Score: 19.770951650969973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

Related papers

GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z)
Bringing Reasoning to Generative Recommendation Through the Lens of Cascaded Ranking [107.09842504618369]
Generative Recommendation (GR) has become a promising end-to-end approach with high FLOPS utilization for resource-efficient recommendation.<n>We show that current GR models suffer from a critical textbfbias amplification issue, where token-level bias escalates as token generation progresses.<n>To combat the bias amplification issue, it is crucial for GR to 1) incorporate more heterogeneous information, and 2) allocate greater computational resources at each token generation step.
arXiv Detail & Related papers (2026-02-03T16:10:54Z)
RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference [46.66085102313264]
Real-time recommender systems execute cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs.<n>We present RelayGR, a production system that enables in-HBM relay-race inference for GR.<n> RelayGR supports up to 1.5$times$ longer sequences and improves SLO-compliant throughput by up to 3.6$times$.
arXiv Detail & Related papers (2026-01-05T01:34:06Z)
ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval [125.19156877994612]
Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids)<n>We propose textscZeroGR, a zero-shot generative retrieval framework that leverages natural language instructions to extend GR across a wide range of IR tasks.<n>Specifically, textscZeroGR is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance
arXiv Detail & Related papers (2025-10-12T03:04:24Z)
REFRAG: Rethinking RAG based Decoding [67.4862300145604]
REFRAG is an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications.<n>We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization.
arXiv Detail & Related papers (2025-09-01T03:31:44Z)
Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [79.75818239774952]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z)
Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z)
MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG [45.319085406042966]
Multi-scale Adaptive Context RAG (MacRAG) is a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities.<n>MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage.<n>Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning.
arXiv Detail & Related papers (2025-05-10T08:50:44Z)
VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [2.0929459605817193]
Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver context-aware responses.<n>We present VectorLiteRAG, a deployment-friendly RAG system that achieves latency-compliant inference without requiring additional hardware resources.
arXiv Detail & Related papers (2025-04-11T19:18:41Z)
RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [6.428086269916113]
We propose RTLRepoCoder, a groundbreaking solution that incorporates specific fine-tuning and Retrieval-Augmented Generation (RAG) for repository-level Verilog code completion.<n>Our solution achieves state-of-the-art performance on public benchmark, significantly surpassing GPT-4 and advanced domain-specific LLMs on Edit Similarity and Exact Match rate.
arXiv Detail & Related papers (2025-04-11T09:04:50Z)
MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation [60.04380907045708]
Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem.<n>We propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval.<n>MemoRAG achieves superior performances across a variety of long-context evaluation tasks.
arXiv Detail & Related papers (2024-09-09T13:20:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.