Fugu-MT 論文翻訳(概要): Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

論文の概要: Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

arxiv url: http://arxiv.org/abs/2605.13225v1
Date: Wed, 13 May 2026 09:17:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.937754
Title: Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Title（参考訳）: Mix, Don't Tune:データ制約設定におけるバイリンガル事前学習のハイパーパラメータ検索性能
Authors: Paul Jeha, Anastasiia Sedova, Louis Béthune, Skyler Seto, Jes Frellsen, Pierre Ablin, Natalie Schluter,
Abstract要約: データ制約領域における言語モデルの事前学習を改善する方法について述べる。私たちは低資源のターゲットとしてアラビア語を使い、補助として英語を使います。検証損失に関するユニークなターゲットデータと、ダウンストリームタスクの精度に関する2-13$times$とで、パフォーマンスを2--3$times$と同等に向上します。
参考スコア（独自算出の注目度）: 24.462817377406754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3$\times$ the unique target data on validation loss and 2--13$\times$ on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing's value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via $μ$P.
Abstract（参考訳）: 世界中のほとんどの言語において、言語モデル事前学習は、モデルがトレーニングデータを何度も繰り返しなければならないようなデータ拘束型システムで動作し、一般化を低下させる。 2つの治療法は、高重量の減衰のような攻撃的なハイパーパラメータチューニングと、低リソースのターゲットを直接支援するために高リソースの補助言語からのデータを混合することである。ハイパーパラメータチューニングは、重みを減らしてネットワーク容量を制限することでモデルを正規化するが、補助データ混合は、調整可能な混合比を使用してトレーニング分布を拡張し、新しい知識でトレーニング信号を多様化する。どちらも、データ制約のあるドメインでのトレーニングを改善するための、原則化された方法を提供します。我々は、これらのレバーを4つのモデルスケール(150Mから1.43B)で体系的に比較し、アラビアを低資源目標とし、英語を補助目標とし、約1000回の事前学習走行を行った。 3つの発見がある。第一に、ミキシングは検証損失と下流タスクの精度の両方においてハイパーパラメータチューニングよりも大きく改善され、そのギャップはモデルサイズとともに増大する。次に、2--3$\times$と2-13$\times$と、ダウンストリームタスクの正確性に関するユニークなターゲットデータと、モデルサイズの急激なスケーリングによってパフォーマンスを向上します。第三に、この分散は、ターゲット言語による検証損失が、混合の値の体系的に過小評価することを明らかにする。トレーニング信号を多様化し、繰り返しターゲットコーパスが供給できない知識を寄与させることにより、混合は正規化され、バリデーション損失は最初の効果のみをキャプチャする。我々の実践的な推奨事項は、ハイリソース言語での混合、ハイパーパラメータチューニングよりも混合比を優先、そして$μ$Pで小さなプロキシモデルからハイパーパラメータを転送することである。

論文の概要: Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

関連論文リスト