Fugu-MT 論文翻訳(概要): Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

論文の概要: Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

arxiv url: http://arxiv.org/abs/2604.02324v1
Date: Thu, 02 Apr 2026 17:59:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.991721
Title: Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Title（参考訳）: ジェネレーションレコメンデーションのためのLMにおける新しい語彙の接地トークン初期化
Authors: Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak,
Abstract要約: 言語モデル(LM)は、ドメイン固有のタスクのための新しい学習可能な語彙トークンで拡張されつつある。標準的な慣行は、これらの新しいトークンを既存の語彙埋め込みの手段として初期化し、それから教師付き微調整に頼って表現を学習する。本論文は, 精密学習の前に, 予め訓練された埋め込み空間に新しいトークンを言語的に基礎付けることを目的とした, emphGrounded Token Initialization hypothesisを提案する。
参考スコア（独自算出の注目度）: 15.12832019023085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
Abstract（参考訳）: 言語モデル(LM)は、ジェネレーティブレコメンデーションにおいてセマンティックIDトークンのようなドメイン固有のタスクのための新しい学習可能な語彙トークンで拡張されつつある。標準的な慣行は、これらの新しいトークンを既存の語彙埋め込みの手段として初期化し、それから教師付き微調整に頼って表現を学習する。スペクトルおよび幾何学的診断を通して、平均初期化によって全ての新しいトークンが縮退した部分空間に崩壊することを示し、その後の微調整が完全に回復する難しさを解消する。これらの結果から,新しい語彙でLMを拡張する際には,emph{token initialization} が重要なボトルネックとなることが示唆された。この診断に動機づけられた本論文では,前訓練された埋め込み空間に新しいトークンを言語的に基礎付けることによって,新しいドメインに対する汎用知識を活用できる,という仮説を提唱する。我々は、この仮説をGTI(Grounded Token Initialization)として運用する。これは、微調整に先立って、ペア言語監督のみを用いて、事前訓練された埋め込み空間において、新しいトークンを意味的に意味のある明確な場所にマッピングする軽量な基盤ステージである。その単純さにもかかわらず、GTIは業界規模や公開データセットを含む複数のジェネレーティブレコメンデーションベンチマークにおいて、評価設定の大部分において、平均初期化と既存の補助タスク適応手法の両方を上回ります。さらなる分析により、接地埋め込みは、微調整によって持続するよりリッチなトケン構造を生み出し、初期化品質が語彙拡張の重要なボトルネックであるという仮説を裏付けている。

論文の概要: Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

関連論文リスト