Fugu-MT 論文翻訳(概要): Bolmo: Byteifying the Next Generation of Language Models

論文の概要: Bolmo: Byteifying the Next Generation of Language Models

arxiv url: http://arxiv.org/abs/2512.15586v1
Date: Wed, 17 Dec 2025 16:46:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-18 17:06:27.063468
Title: Bolmo: Byteifying the Next Generation of Language Models
Title（参考訳）: Bolmo: 次世代の言語モデルを簡単にする
Authors: Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann,
Abstract要約: 競合する完全オープンなバイトレベル言語モデル(LM)の最初のファミリーであるBolmoを紹介します。バイト化はサブワードトークン化の限界を克服する。我々はBolmoがサブワードレベルのLMと競合する推論速度を実現できることを示す。
参考スコア（独自算出の注目度）: 115.32940292418463
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
Abstract（参考訳）: 1B と 7B のパラメータスケールで完全にオープンなバイトレベル言語モデル (LM) の最初のファミリーである Bolmo を紹介する。主にゼロからトレーニングすることに焦点を当てたバイトレベルのLMに関する以前の研究とは対照的に、既存のサブワードレベルのLMをバイト化することによってボルモを訓練する。バイト化は、固定されたサブワード語彙による文字理解の不足や効率の制約など、サブワードトークン化の制限を克服し、主要なサブワードレベルのLMのレベルで実行することを可能にする。我々のアーキテクチャは、事前のバイトレベルのアーキテクチャとサブワードレベルのLMの表現率のミスマッチを解決し、ボルモとソースのサブワードモデルとの効果的な正確な蒸留の目的を実現できる。これにより、通常の事前訓練トークン予算の1倍未満を投資することで、サブワードレベルのLMをバイトレベルのLMに変換することができる。 Bolmo は以前のバイトレベル LM よりも大幅に優れており、文字理解やコーディングにおいて、元のサブワードレベル LM よりも優れており、他のタスクでは元の LM のパフォーマンスとほぼ一致している。さらに,Bolmoは,より高いトークン圧縮比でトレーニングすることで,サブワードレベルのLMと競合する推論速度を実現し,ソースサブワードレベルのLMを取り巻く既存のエコシステムを活用して,安価かつ効果的にポストトレーニングできることを示す。その結果、バイトレベルのLMは、幅広いユースケースでサブワードレベルのLMと競合する実用的な選択肢となった。

論文の概要: Bolmo: Byteifying the Next Generation of Language Models

関連論文リスト