Fugu-MT 論文翻訳(概要): The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

論文の概要: The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

arxiv url: http://arxiv.org/abs/2606.24460v1
Date: Tue, 23 Jun 2026 11:47:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.932794
Title: The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs
Title（参考訳）: アフリカ語税:フロンティアLEMにおけるアフリカの言語をトークン化する費用、レイテンシ、文脈のペナルティの定量化
Authors: Olaoye Anthony Somide,
Abstract要約: トークンライザは、他の言語よりも多くのサブワードトークンを割り当てるため、高いトークンフェタリティを持つ言語の話者は、モデルが呼び出される前に構造的なペナルティを支払う。このペナルティは、一般に多言語設定のために文書化されているが、アフリカ語では体系的に測定されていない。 5つの言語ファミリーと3つのスクリプトにまたがる20のアフリカの言語で測定します。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Commercial large language models bill, scale latency, and budget context per token. Yet tokenizers assign more subword tokens to the same meaning in some languages than in others, so speakers of languages with high token-fertility pay a structural penalty before a model is ever invoked. This penalty is documented for multilingual settings in general, but it has not been measured systematically for African languages at the level of enterprise deployment economics and cognitive context capacity. We measure it across 20 African languages spanning five language families and three scripts (Latin, Ge'ez/Ethiopic, N'Ko; 19 appear in the primary FLORES-200+ corpus, with Nigerian Pidgin measured via MAFAND-MT only), using parallel corpora so that the language effect is isolated from content. Across 11 frontier and open tokenizers on FLORES-200+, every African language carries a tokenization premium above English (median 1.88x on GPT-5 / o200k_base, up to 8.92x for N'Ko); the penalty is largest for Ethiopic and N'Ko scripts (reaching 7-9x) and is near-invariant across corpora (FLORES vs SIB-200 Pearson r = 0.9998). Translated into deployment terms, this results in up to 8.9x inference cost and an equivalent generation-latency multiplier (N'Ko vs English on GPT-5; 7.4x for Amharic), and as little as 11% of English's effective context window. The best currently available tokenizer for African languages, Gemma 4, reduces the mean premium from 3.31x (cl100k_base) to 2.38x, but no tokenizer eliminates the penalty. We release an open measurement tool (afri-fertility), a public leaderboard, a results dataset, and mitigation guidance for African builders. The penalty falls hardest on the languages whose speakers can least afford it, a digital divide encoded directly into the subword vocabulary.
Abstract（参考訳）: 商用の大規模言語モデルでは、トークン単位の請求書、スケールレイテンシ、予算コンテキストが使用される。しかし、トークンライザは他の言語よりも多くのサブワードトークンを同じ意味に割り当てるため、高いトークンフェタリティを持つ言語の話者は、モデルが呼び出される前に構造的なペナルティを支払う。このペナルティは、一般に多言語設定のために文書化されているが、企業展開経済と認知コンテキスト能力のレベルにおいて、アフリカ言語では体系的に測定されていない。 5つの言語ファミリーと3つのスクリプト(Latin, Ge'ez/Ethiopic, N'Ko; 19はFLORES-200+コーパスに登場し、ナイジェリアのPidginはMAFAND-MTでのみ測定され、並列コーパスを使用して言語効果をコンテンツから分離する。 FLORES-200+上の11のフロンティアとオープントークンライザにまたがって、すべてのアフリカの言語は、英語以上のトークン化プレミアム(GPT-5 / o200k_baseの1.88x、N'Koの8.92x)を持ち、ペナルティはエチオピックとN'Koスクリプト(7-9x)で最大であり、コーパス全体でほぼ不変である(FLORES vs SIB-200 ピアソン r = 0.9998)。デプロイメント用語に翻訳すると、最大で8.9倍の推論コストと同等の世代遅延乗算器(GPT-5ではN'Ko、アムハラ語では7.4倍)と、イングランドの効果的なコンテキストウィンドウの11%に満たない。現在入手可能なアフリカの言語で最高のトークン化ツールであるGemma 4は、平均プレミアムを3.31x(cl100k_base)から2.38xに下げるが、トークン化ツールがペナルティを排除しない。オープンな測定ツール(afri-fertility)、公開リーダボード、結果データセット、アフリカ人構築者の緩和ガイダンスをリリースしています。ペナルティは、話者が最低でもそれを買うことができる言語において最も困難に陥り、サブワード語彙に直接エンコードされたデジタルディビジョンである。

論文の概要: The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

関連論文リスト