Fugu-MT 論文翻訳(概要): Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

論文の概要: Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

arxiv url: http://arxiv.org/abs/2511.13640v1
Date: Mon, 17 Nov 2025 17:53:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 18:52:09.647821
Title: Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures
Title（参考訳）: スケーリング時代のデータ価値: 実合成データ混合下でのLLMスケーリングダイナミクスの理解
Authors: Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou,
Abstract要約: 大規模言語モデル(LLM)は、実データと合成データをブレンドするデータセット上に構築されている。合成データはスケーラビリティとコスト効率を提供するが、しばしば系統的な分散の相違をもたらす。本稿では,大規模データセットにスケールする効果的なデータ評価手法を提案する。
参考スコア（独自算出の注目度）: 32.89034139737846
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な進歩は、実データと合成データを混合するデータセットへの依存の高まりによって加速される。合成データはスケーラビリティとコスト効率を提供するが、特にトップpサンプリング、温度スケーリング、有限サンプリングなどのデータ生成メカニズムによるトランケーション効果によるロングテール知識の不足など、系統的な分散の相違をもたらすことが多い。これらの相違は、混合現実合成データセットの有用性を特徴づけ、評価する上で、根本的な課題となる。本稿では,学習の頭と尾の知識間のモデル行動の遷移を反映した2つのブレークポイントを特徴とする3段階のスケーリング行動を特定する。さらに, 実混合および合成混合のために設計されたLLM一般化を導出し, それらの一般化性能を規定するいくつかの重要な要因を明らかにした。理論的な結果に基づいて,大規模データセットにスケールする効率的かつ効率的なデータ評価手法を提案する。画像分類,感情分類,指示追従,複雑な推論を含む4つのタスクの総合的な実験により,我々の手法は,計算コストが著しく低いデータ評価において,最先端のベースラインを超えていることが実証された。

論文の概要: Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

関連論文リスト