Fugu-MT 論文翻訳(概要): Separate Before You Compress: The WWHO Tokenization Architecture

論文の概要: Separate Before You Compress: The WWHO Tokenization Architecture

arxiv url: http://arxiv.org/abs/2603.25309v1
Date: Thu, 26 Mar 2026 10:56:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.251091
Title: Separate Before You Compress: The WWHO Tokenization Architecture
Title（参考訳）: 圧縮前に分離する - WWHOのトークン化アーキテクチャ
Authors: Kusal Darshana,
Abstract要約: 現在のLarge Language Models (LLM) は、主にBPE(Byte Pair Linguist)ベースのトークンライザを使用している。 WWHO(Where-What-How)とSGPE(Syllable Grapheme Pair Linguist)というアルゴリズムを提案する。 Sinhala と Devanagari (Hindi/Sanskrit) を非常に複雑な Abugida スクリプトとして使用し、クリーン化された30万文データセットで WWHO をトレーニングし、1,499,950文のテストセットで評価した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.
Abstract（参考訳）: 現在のLarge Language Models (LLM) は、主にBPE(Byte Pair Encoding)ベースのトークンライザを使用しており、英語のような単純な構造化ラテン文字に対して非常に効果的である。しかし、標準的なBPEトークンライザは、構造的な複雑さのため、複雑なAbugidaスクリプトを処理するのに苦労している。問題は、これらのトークン化器が、マルチコードポイントグラフ化クラスタである複雑な結合を無意味なサブ文字単位に分解することである。このことはLLMの推論効率を低下させ、推論時に基本的な正書構造を学習させ、推論コストを上昇させ、グローバル・サウスにとって重要な「トークン税」をもたらす。我々は,新しい3層アーキテクチャ,WWHO (Where-What-How Often) とSGPE (Syllable-aware Grapheme Pair Encoding) を提案する。 Sinhala と Devanagari (Hindi/Sanskrit) を非常に複雑な Abugida スクリプトとして使用し、クリーン化された30万文データセットで WWHO をトレーニングし、1,499,950文のテストセットで評価した。シンハラでは、SGPEはトークン当たり4.83文字のToken to Word Ratio(TWR)を1.274で達成し、OpenAIのo200kベースに比べて61.7%のトークン削減を実現している。ヒンディー語では、TWRは1.181(27.0%還元対o200k)である。 SGPEのデータセット(シンハラ、デバナガリ、英語)では、全体的なTWRは1.240で、トークンの減少率は36.7%、39.6%、O200kベースに対して60.2%、Llama 4 Scout、DeepSeek V3である。これにより、これらのバグダ言語で使用可能なコンテキストウィンドウを最大4.38倍拡張し、言語的にゼロ・ブレカジ保証を保証することで、有効な音節が複数のトークンに分割されることを保証できる。

論文の概要: Separate Before You Compress: The WWHO Tokenization Architecture

関連論文リスト