Fugu-MT 論文翻訳(概要): Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

論文の概要: Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

arxiv url: http://arxiv.org/abs/2509.06463v2
Date: Tue, 28 Oct 2025 08:59:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.096536
Title: Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set
Title（参考訳）: 命令セットの被覆と深さの定量化によるLLMファインタニングの高速化
Authors: Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou,
Abstract要約: スーパーバイドファインチューニング(SFT)に使用されるデータのスケーリングは、モデル性能の比例的なゲインを保証するものではない。この研究は、SFTスケーラビリティを管理する2つの基本的なデータセット特性を特定する。モデルに依存しないデータ選択フレームワークである textbfInformation Landscape Approximation (ILA) を提案する。
参考スコア（独自算出の注目度）: 37.26992936545316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.
Abstract（参考訳）: スーパーバイド・ファインチューニング(SFT)に使用されるデータの量を増やすことは、モデルの性能の比例的な向上を保証しない。この研究は、SFTスケーラビリティを管理する2つの基本的なデータセット特性を識別する: \textbf{semantic coverage}, or the breadth of task domain, and \textbf{information depth}, or the richness of individual example。これらの特性に対する単純なプロキシは、実験におけるバリデーション損失の分散の大部分を説明する。本研究では,これらの2因子を協調的に最適化するモデルに依存しないデータ選択フレームワークである<textbf{Information Landscape Approximation (ILA) を提案する。 ILAは、大きなデータセットの情報量に近似したコンパクトなサブセットを構築する。実験結果から, ILA選択データに調整されたモデルでは, 従来の手法と比較して, 多様なタスクやモデルサイズにまたがって, より高速かつ持続的な性能向上が達成された。

論文の概要: Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

関連論文リスト