Fugu-MT 論文翻訳(概要): A Family of LLMs Liberated from Static Vocabularies

論文の概要: A Family of LLMs Liberated from Static Vocabularies

arxiv url: http://arxiv.org/abs/2603.15953v1
Date: Mon, 16 Mar 2026 22:07:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.01109
Title: A Family of LLMs Liberated from Static Vocabularies
Title（参考訳）: 静的語彙から解放されたLLMの一家系
Authors: Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum,
Abstract要約: 階層型自己回帰変換器(HAT)アーキテクチャに基づく最大700億のパラメータを持つモデル群を提示する。 Llama 3.1 8B と 70B のモデルを HAT アーキテクチャに変換することで,利用可能な事前学習モデルの再利用が可能であることを示す。また、7B HATモデルであるLlama-TFree-HAT-Pretrainedも提供しています。
参考スコア（独自算出の注目度）: 23.053922969985738
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.
Abstract（参考訳）: トークン化は、現在の大規模言語モデル(LLM)における自然言語処理の中心的なコンポーネントであり、モデルが生のテキストを処理可能な単位に変換することを可能にする。学習されたトークン化剤は広く採用されているが、その大きな、固定された語彙サイズや新しいドメインや言語への適応性など、顕著な制限がある。階層型自己回帰変換器(HAT)アーキテクチャに基づく最大700億のパラメータを持つモデル群を提示する。 HATでは、エンコーダ変換器がバイトを単語の埋め込みに集約し、それらを古典的な自己回帰変換器であるバックボーンに供給する。バックボーンの出力はデコーダによってクロスアタッチされ、バイトに変換される。 Llama-3.1-8B-TFree-HAT と Llama-3.1-70B-TFree-HAT は、エンコーダとデコーダをスクラッチからトレーニングしたバイトレベルのモデルである。また、7B HATモデルであるLlama-TFree-HAT-Pretrainedも提供しています。 HATアーキテクチャは、必要なシーケンス位置の数を減らすことでテキスト圧縮を改善し、単語内のバリエーション、例えばスペルの違いに対する堅牢性を高める。事前学習と、それに続く英語とドイツ語の微調整と直接選好最適化により、ほとんどのベンチマークではLlama 3.1のオリジナルの性能が向上した。私たちはHugging Faceでモデル(200の事前トレーニングチェックポイントを含む)をリリースしています。

論文の概要: A Family of LLMs Liberated from Static Vocabularies

関連論文リスト