Fugu-MT 論文翻訳(概要): Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

論文の概要: Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

arxiv url: http://arxiv.org/abs/2605.13485v1
Date: Wed, 13 May 2026 13:08:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.061904
Title: Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Title（参考訳）: 変圧器の有効文脈:フラグメンテーションとトークン化の分析
Authors: Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten,
Abstract要約: 表現選択は、有限コンテキスト予測器が達成できることをどう変えるかを検討する。より小さな表現単位への移行は、コンテキストウィンドウが拡大しても予測を損なう可能性があることを示す。トークン化は、短いトークンウィンドウを、より長いソースコンテキストウィンドウのように振る舞うことができることを示す。
参考スコア（独自算出の注目度）: 4.364999214109123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.
Abstract（参考訳）: トランスフォーマーはシーケンスの表現を予測します。同じデータをバイト、文字、またはサブワードトークンとして書くことができ、これらの表現は失われる可能性がある。しかし、固定されたコンテキストウィンドウの下では、同じ情報をモデルに公開する必要はない。これは基本的な疑問を提起する: 表現の選択は、どのように有限コンテキスト予測器が達成できるものを変更するのか? この問題をマルコフ源で研究し、2つの相補的な現象を明らかにする。まず、より小さな表現単位への移行は、関連するソース履歴をカバーするためにコンテキストウィンドウを拡大しても予測を損なう可能性があることを観察する。これを説明するために、各ソースシンボルをいくつかの小さなユニットで置き換える、ロスレスな復号法であるフラグメンテーションを導入する。フラグメンテーションは最適有限コンテキストのログロスを厳密に増加させることができることを証明し、ギャップは単なる最適化やキャパシティの問題ではなく、表現に固有のものであることを示す。このことは、ByT5やCANINEのようなバイトレベルおよび文字レベルのモデルで見られる有限コンテキストギャップを、サブワードトークン化モデルと比較して理論的に説明する。第二に、greedyトークン化 -- BPE、WordPiece、および関連するメソッド -- は、ソースシンボルをより大きな単位にグループ化する。トークン化によって、短いトークンウィンドウがより長いソースコンテキストウィンドウのように振る舞うことができることを示す。この保証は、ウィンドウが必要なソース履歴にどの程度確実に分散しているか、トークン化器の圧縮速度に依存する。固定トークンウィンドウがどの程度のソースコンテキストを確実に含んでいるかを測定する。この2つの方向は、変換器における表現の選択について推論するための有限コンテキスト情報理論の枠組みを確立する。

論文の概要: Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

関連論文リスト