Fugu-MT 論文翻訳(概要): Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

論文の概要: Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

arxiv url: http://arxiv.org/abs/2508.15390v1
Date: Thu, 21 Aug 2025 09:26:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.263838
Title: Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Title（参考訳）: 言語モデル事前学習における語彙周波数不均衡の爆発
Authors: Woojin Chung, Jeonghoon Kim,
Abstract要約: 大規模言語モデルはトークン化器で訓練され、結果として生じるトークン分布は極めて不均衡である。最近の実践では、より大きい語彙が好まれているが、その利点の源泉は不明である。データ、計算、最適化を固定しながら、言語モデルの語彙を24Kから196Kにスケールする制御された研究を行う。
参考スコア（独自算出の注目度）: 3.7752830020595787
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.
Abstract（参考訳）: 大規模な言語モデルはトークン化器で訓練されており、結果として生じるトークンの分布は極めて不均衡である。最近の実践では、より大きい語彙が好まれているが、その利点の源泉は不明である。データ、計算、最適化を固定しながら、言語モデルの語彙を24Kから196Kにスケールする制御された研究を行う。まずトークン化されたテキストの複雑さを定量化し、コルモゴロフの複雑さを通して形式化し、より大きな語彙がこの複雑さを減少させることを示す。 24Kを超えると、すべての共通語は1つのトークンであり、それ以外は相対的なトークン/周波数の不均衡が強くなる。単語レベルの損失分解は、大きな語彙は、まれな尾の損失が上昇しても、2500の最も頻繁な単語に対する不確実性を低下させることによって、ほぼ独占的にクロスエントロピーを減少させることを示している。入力と出力の埋め込みノルムを制限し、トークン周波数の不均衡の影響を緩和することで利得を逆転させ、モデルが不均衡に苦しむのではなく悪用することを示す。同じ頻度の単語がダウンストリームベンチマークのトークンの約77%をカバーしているため、このトレーニングの利点はそのまま移行できる。また、固定語彙でモデルパラメータを拡大すると、同じ頻繁な単語の利点が得られることを示す。我々の結果は、"より大きな語彙は、トークン化テキストの複雑さを減らし、トークン化モデルの共同設計のためのシンプルで原則化されたレバーを提供し、事前学習における言語モデルスケーリングを管理する損失ダイナミクスを明確にする"ものとして、"より大きい語彙が役立つ"と再設定しました。

論文の概要: Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

関連論文リスト