Fugu-MT 論文翻訳(概要): Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

論文の概要: Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

arxiv url: http://arxiv.org/abs/2510.05133v1
Date: Wed, 01 Oct 2025 03:28:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:07.837229
Title: Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios
Title（参考訳）: 合成データ学習におけるモデル行動の特徴づけ--スケールと混合比の実証的研究
Authors: Y. Du, G. Wu, G. Tang, W. Wang, Q. Fan,
Abstract要約: 本稿では, モデル性能, キャリブレーション, 出力特性を, 各種合成・外部データ比で学習した場合に比較検討する。モデルが最大20%の合成データで安定した性能を維持するが、劣化は30%以上加速する。 80%以上の外部データを維持するSTaRやセルフインストラクトシステムで採用されている現在のベストプラクティスは、我々の実験によって特定された安全な体制の中でうまく機能する。
参考スコア（独自算出の注目度）: 1.631115063641726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.
Abstract（参考訳）: 大規模言語モデルによって生成された合成データは、ブートストラップ推論機能から命令追従データセットの拡張に至るまで、現代のNLPトレーニングパイプラインに不可欠なものとなっている。最近の研究は、高い外部データ比を維持するアプリケーションの成功例を示しているが、合成データ比が異なるスケールにわたるモデル行動にどのように影響するかについての体系的な理解は、依然として限られている。本稿では, モデル性能, キャリブレーション, 出力特性を, 各種合成・外部データ比で学習した場合に比較検討する。 5つのタスクにまたがるPythiaモデルスイート(410M-12Bパラメータ)を用いて,0～50\%の合成データの割合で1～3回のトレーニングを繰り返した後に,モデルを評価する。モデルでは, 最大20 % の合成データを安定的に維持するが, 劣化は30 % 以上加速し, 大型モデル (6.9B-12B) では, より小型モデル (410M-1.4B), キャリブレーション劣化は精度低下に先行し, 早期警告信号, タスク特性が重要であり, 検索タスクよりも高速に劣化する。重要なことは、STaRやSelf-Instructシステムで採用されている、80%以上の外部データを維持するような現在のベストプラクティスが、我々の実験によって特定された安全な体制の中でうまく機能していることである。本稿では,Shumailovらによるモデル崩壊発見を含む同時作業との比較とともに,モデルスケールとタスク要求に基づくデータ予算の実践者に対する実践的ガイダンスを提供する。

論文の概要: Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

関連論文リスト