Fugu-MT 論文翻訳(概要): Faster Superword Tokenization

論文の概要: Faster Superword Tokenization

arxiv url: http://arxiv.org/abs/2604.05192v1
Date: Mon, 06 Apr 2026 21:43:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.503759
Title: Faster Superword Tokenization
Title（参考訳）: より高速なスーパーワードトークン化
Authors: Craig W. Schmidt, Chris Tanner, Yuval Pinter,
Abstract要約: 本稿では,2相境界BPEを用いて,正規マージの第1相学習とスーパーマージの第2相学習を分離する。 BoundlessBPE、SuperBPE、BPEそれぞれに対して、リファレンスPython実装と高速Rust実装の両方をオープンソースにしています。
参考スコア（独自算出の注目度）: 10.08525888469663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.
Abstract（参考訳）: Byte Pair Encoding (BPE) は広く使われているトークン化アルゴリズムであり、トークンは事前トークン化の境界を越えて拡張できない。バウンドレスBPEとスーパーBPEのアルゴリズムは、この制限を緩和し、句を形成するプリトケンの組み合わせであるスーパーワードの形成を可能にすることで、BPEを拡張し改善する。例えば、BoundlessBPEは1GBのデータでトレーニングするために4.7CPU日を要した。スーパーマージ候補は2つ以上の連続プリトーケンであり、通常のプリトーケンと同様の頻度で集約可能であることを示す。これにより、BoundlessBPEとSuperBPEのオリジナルの実装が必要なため、完全なドキュメントをメモリに保持することができない。本稿では, 正規マージの第1相学習とスーパーマージの第2相学習を分離した2相BPEの定式化について述べる。また,2相境界BPEとSuperBPEのほぼ等価性を示すとともに,SuperBPEの2相境界BPEでは,手動で選択したハイパーパラメータを自動的に決定できる点が異なる。これらの変更によりより高速な実装が可能となり、BoundlessBPEとSuperBPEでそれぞれ603秒と593秒で同じ1GBのデータでトレーニングが可能になった。 BoundlessBPE、SuperBPE、BPEそれぞれに対して、リファレンスPython実装と高速Rust実装の両方をオープンソースにしています。

論文の概要: Faster Superword Tokenization

関連論文リスト