Fugu-MT 論文翻訳(概要): Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

論文の概要: Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

arxiv url: http://arxiv.org/abs/2604.27747v1
Date: Thu, 30 Apr 2026 11:37:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.070861
Title: Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
Title（参考訳）: LLMに基づくジェネレーティブリストワイズ勧告における推論高速化のための位置認識描画
Authors: Jiaju Chen, Chongming Gao, Chenxiao Fan, Haoyan Liu, Qingpeng Cai, Peng Jiang, Xiangnan He,
Abstract要約: PAD-Recは2つの補完信号でドラフトモデルを増強する軽量モジュールである。アイテム位置埋め込みは、トークン内のスロットを明示的にエンコードする。ステップ位置埋め込みはドラフトステップをエンコードし、モデルが深さ依存の不確実性に適応できるようにする。
参考スコア（独自算出の注目度）: 27.749196490846916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.
Abstract（参考訳）: 大規模言語モデル (LLM) ベースの生成的リストワイドレコメンデーションは急速に進歩しているが、復号化はシーケンシャルであり、遅延が発生しやすい。ターゲット分布を変更することなく推論を高速化するために、投機的復号法(SD)は小さなドラフトモデルを用いて、複数の次のトークンを同時に提案する。しかし、生成的推奨では、各項目は複数のセマンティックIDトークンで表現され、しばしばセパレータで表現される。これは2つの現実を見落としている。 (i)トークンの意味は内部のスロットに依存し、 (二)投機深度により不確実性が増大する傾向がある。これらの効果をモデル化しなければ、SDのスピードアップは制限される。 PAD-Rec, position-Aware Drafting for Generative Recommendation, a lightweight module that a draft model augments with two complementary signal。アイテム位置埋め込みは、各トークンの内部スロットを明示的にエンコードし、構造的認識を強化する。ステップ位置埋め込みはドラフトステップをエンコードし、モデルが深さ依存の不確実性に適応し、提案品質を改善する。これらの信号と基本特性を調和させるために、アイテムスロットの学習可能な係数とドラフトステップのコンテキスト駆動ゲートという単純なゲートを追加する。モジュールはトレーニング可能で、標準のドラフトモデルとの統合が容易で、無視可能な推論オーバーヘッドを追加する。 4つの実世界のデータセットの大規模な実験では、推奨品質を保ちながら、壁時計のスピードアップが3.1倍、壁時計の平均速度アップが5%向上する。

論文の概要: Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

関連論文リスト