Fugu-MT 論文翻訳(概要): The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

論文の概要: The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

arxiv url: http://arxiv.org/abs/2508.06533v1
Date: Sun, 03 Aug 2025 15:31:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.414793
Title: The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Title（参考訳）: 単語を破る技術:多言語トケナイザの設計を再考する
Authors: Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan,
Abstract要約: 既存のトークン化器は高いトークン対ワード比、文脈長の非効率な使用、推論の遅さを示す。本稿では,語彙サイズ,事前トークン化規則,トレーニングコーパス構成をトークン・ツー・ワード効率とモデル品質の両方に関連付ける体系的な研究を提案する。我々のトークンライザは、最先端の多言語インデックスモデルに対して平均トークン対ワード比を40%以上改善する。
参考スコア（独自算出の注目度）: 21.9940001977516
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs
Abstract（参考訳）: モデルアーキテクチャとトレーニングの目的はよく研究されているが、トークン化は特に多言語的文脈において、大規模言語モデル(LLM)開発において比較的無視されている側面である。既存のトークン化器は高いトークン対ワード比、文脈長の非効率な使用、推論の遅さを示すことが多い。本稿では,語彙サイズ,事前トークン化規則,トレーニングコーパス構成をトークン・ツー・ワード効率とモデル品質の両方に関連付ける体系的な研究を提案する。言語学的に多様性のある文脈で分析を行うため,インディックスのスクリプトについて広範な実験を行い,高いスクリプトの多様性と正書法的な複雑さにより,独特な課題を呈している。これらの分析から得られた知見に基づき、トークン化学習のための多言語データのバランスをとるデータ合成のための新しいアルゴリズムを提案する。データ合成アルゴリズムは従来のデータランダム化手法と比較して平均トークン対ワード比を約6%削減する。我々のトークンライザは、最先端の多言語インデックスモデルに対して平均トークン対ワード比を40%以上改善する。この改善により、モデル性能と推論速度の両方において測定可能な利得が得られる。これは、効率的でスケーラブルな多言語LLMを構築するための重要なレバーとして、アーキテクチャとトレーニング目的と共にトークン化を強調します。

論文の概要: The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

関連論文リスト