Related papers: Efficient Inference for Large Language Model-based Generative Recommendation

Efficient Inference for Large Language Model-based Generative Recommendation

URL: http://arxiv.org/abs/2410.05165v2
Date: Tue, 8 Oct 2024 13:33:52 GMT
Title: Efficient Inference for Large Language Model-based Generative Recommendation
Authors: Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua,
Abstract summary: Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly. Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items. We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
Score: 78.38878421030522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.

Related papers

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens [47.60523011706102]
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency.<n>We propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries.
arXiv Detail & Related papers (2025-07-01T12:42:06Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Tutorial Proposal: Speculative Decoding for Efficient LLM Inference [13.711626189861313]
Speculative Decoding (SD) is an advanced technique for LLM inference acceleration. This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies.
arXiv Detail & Related papers (2025-03-01T13:34:42Z)
Speeding up Speculative Decoding via Approximate Verification [7.754712828900729]
Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs) We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM. We present a theoretical analysis of SPRINTER, examining the statistical properties of the generated tokens, as well as the expected reduction in latency.
arXiv Detail & Related papers (2025-02-06T23:10:53Z)
Constrained Decoding with Speculative Lookaheads [13.085794785286305]
We propose constrained decoding with speculative lookaheads (CSL) CSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
arXiv Detail & Related papers (2024-12-09T22:29:57Z)
An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token. We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems [49.588316022381385]
We propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens. DARE has been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [47.5772915135952]
Large language models (LLMs) now support extremely long context windows. The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling [40.535672813968375]
Safety of Large Language Models (LLMs) has become a critical issue given their rapid progresses. We study a new algorithm called $ttexttProbe sampling$ to reduce the time cost of GCG. Probe sampling is also able to accelerate other prompt optimization techniques and adversarial methods.
arXiv Detail & Related papers (2024-03-02T16:23:44Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning [3.117335706912261]
We introduce Evidence to Generate (E2G), a novel single-agent, two-step prompting framework. Instead of unverified reasoning claims, E2G focuses exclusively on the thought sequences explicitly mentioned in the context. tool achieves remarkable results robustly across a wide range of knowledge-intensive reasoning and generation tasks.
arXiv Detail & Related papers (2024-01-11T09:49:15Z)
Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection [52.86681130880647]
Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs. We propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals. Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
arXiv Detail & Related papers (2021-10-14T16:31:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.