Fugu-MT 論文翻訳(概要): VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

論文の概要: VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

arxiv url: http://arxiv.org/abs/2508.17125v1
Date: Sat, 23 Aug 2025 19:58:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.351571
Title: VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling
Title（参考訳）: VQL: 超長期ユーザ行動モデリングのためのエンド・ツー・エンドコンテキスト対応ベクトル量子化アテンション
Authors: Kaiyuan Li, Yongxiang Tang, Yanhua Cheng, Yong Bai, Yanxiang Zeng, Chao Wang, Xialong Liu, Peng Jiang,
Abstract要約: 大規模レコメンデーションシステムでは、超長期のユーザ行動シーケンスは、進化する関心の豊かなシグナルを符号化する。超長期動作モデリングのためのコンテキスト対応ベクトル量子化アテンションフレームワークであるVQLを提案する。
参考スコア（独自算出の注目度）: 12.619238878583703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to latency and memory constraints. Existing solutions fall into two categories: (1) top-k retrieval, which truncates the sequence and may discard most attention mass when L >> k; and (2) encoder-based compression, which preserves coverage but often over-compresses and fails to incorporate key context such as temporal gaps or target-aware signals. Neither class achieves a good balance of low-loss compression, context awareness, and efficiency. We propose VQL, a context-aware Vector Quantization Attention framework for ultra-long behavior modeling, with three innovations. (1) Key-only quantization: only attention keys are quantized, while values remain intact; we prove that softmax normalization yields an error bound independent of sequence length, and a codebook loss directly supervises quantization quality. This also enables L-free inference via offline caches. (2) Multi-scale quantization: attention heads are partitioned into groups, each with its own small codebook, which reduces quantization error while keeping cache size fixed. (3) Efficient context injection: static features (e.g., item category, modality) are directly integrated, and relative position is modeled via a separable temporal kernel. All context is injected without enlarging the codebook, so cached representations remain query-independent. Experiments on three large-scale datasets (KuaiRand-1K, KuaiRec, TMALL) show that VQL consistently outperforms strong baselines, achieving higher accuracy while reducing inference latency, establishing a new state of the art in balancing accuracy and efficiency for ultra-long sequence recommendation.
Abstract（参考訳）: 大規模レコメンデーションシステムでは、超長期のユーザ行動シーケンスは、進化する関心の豊かなシグナルを符号化する。シーケンス長の拡張は一般的に精度を向上させるが、そのようなシーケンスを本番環境で直接モデル化することはレイテンシとメモリの制約のため不可能である。既存のソリューションは、2つのカテゴリに分類される: (1) シーケンスを切断し、L >> k のときに最も注目されるマスを破棄するトップk検索、(2) エンコーダベースの圧縮。どちらのクラスも低損失圧縮、文脈認識、効率のバランスが良くない。超長期動作モデリングのためのコンテキスト対応ベクトル量子化アテンションフレームワークであるVQLを提案する。 1)鍵のみの量子化:注目鍵だけが量子化され、値がそのままであり、ソフトマックス正規化がシーケンス長に依存しないエラーを生じ、コードブックの損失が直接量子化品質を監督することを示す。これにより、オフラインキャッシュによるLフリー推論が可能になる。 2)マルチスケールの量子化:アテンションヘッドはグループに分割され,それぞれが小さなコードブックを持ち,キャッシュサイズを固定したまま,量子化エラーを低減する。 (3)効率的なコンテキストインジェクション:静的特徴(例:アイテムカテゴリ、モダリティ)を直接統合し、相対位置を分離可能な時間カーネルでモデル化する。すべてのコンテキストはコードブックを拡大せずに注入されるため、キャッシュされた表現はクエリ非依存のままである。 3つの大規模データセット(KuaiRand-1K, KuaiRec, TMALL)の実験によると、VQLは強いベースラインを一貫して上回り、推論レイテンシを低減しつつ高い精度を実現し、超長期シーケンスレコメンデーションの正確性と効率のバランスをとる新たな最先端技術を確立している。

論文の概要: VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

関連論文リスト