Fugu-MT 論文翻訳(概要): Training Language Models via Neural Cellular Automata

論文の概要: Training Language Models via Neural Cellular Automata

arxiv url: http://arxiv.org/abs/2603.10055v1
Date: Mon, 09 Mar 2026 18:14:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.599173
Title: Training Language Models via Neural Cellular Automata
Title（参考訳）: ニューラルセルオートマタによる言語モデルの訓練
Authors: Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal,
Abstract要約: 本研究では, セルラーニューラルオートマトン(NCA)を用いて, 事前学習型大規模言語モデルのための合成非言語データを生成することを提案する。 NCAデータは、自然言語に類似した豊富な構造と統計を示しながら、制御可能で安価で大規模に生成できる。 164万個のNAAトークンの事前学習により、ダウンストリーム言語モデリングが最大6%向上し、コンバージェンスも最大1.6倍向上することがわかった。
参考スコア（独自算出の注目度）: 8.490841030371453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
Abstract（参考訳）: プレトレーニングは、ほとんどの表現と能力が取得されるときのように、大きな言語モデル(LLM)にとって不可欠である。しかし、自然言語の事前学習には問題があり、高品質なテキストは有限であり、人間のバイアスを含み、推論と知識を結びつける。自然言語は知性への唯一の道なのか? そこで我々は,ニューラルセルオートマトン (NCA) を用いて,合成,非言語的データを生成することを提案する。 NCAデータは、自然言語に類似した豊富な時空間構造と統計を示しながら、制御可能で安価で大規模に生成できる。 164万個のNAAトークンの事前学習により、ダウンストリーム言語モデリングが最大6%向上し、コンバージェンスも最大1.6倍向上することがわかった。驚くべきことに、これはCommon Crawlから1.6Bの自然言語トークンの事前トレーニングをより多くの計算量で上回っている。これらの利得は、GSM8K、HumanEval、BigBench-Liteなどの推論ベンチマークにも転送される。転送を駆動するものを調べると、注意層が最も転送可能であり、最適なNAAの複雑さはドメインによって異なることが分かります。これらの結果から, 対象領域への合成分布の体系的チューニングが可能となった。より広範に、我々の研究は、完全合成事前学習によるより効率的なモデルへの道を開く。

論文の概要: Training Language Models via Neural Cellular Automata

関連論文リスト