Fugu-MT 論文翻訳(概要): Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

論文の概要: Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

arxiv url: http://arxiv.org/abs/2505.20161v1
Date: Mon, 26 May 2025 16:05:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-27 19:27:26.922327
Title: Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Title（参考訳）: 原始合成:勾配に基づくデータ分散化はLLM推論における一般化を促進する
Authors: Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi,
Abstract要約: 我々は,データ多様性が言語モデルにおける一般化の強力な予測因子であることを示す。モデル誘起勾配のエントロピーを通して多様性を定量化する計量であるG-Vendiを導入する。多様な合成データを生成するためのフレームワークであるPrismatic Synthesisを提案する。
参考スコア（独自算出の注目度）: 77.120955854093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $\rho \approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.
Abstract（参考訳）: 言語モデルの効果的な一般化は、トレーニングデータの多様性に大きく依存する。しかし、既存の多様性のメトリクスは、モデル行動から切り離された表面レベルのヒューリスティックに依存して、この目標を達成できないことが多い。トレーニングデータの多様性は、実際に言語モデルの一般化を促進する -- そして、それをどのように測定し、増幅するか? 300以上のトレーニング実行にまたがる大規模な経験的分析を通じて、データスケールと品質を慎重に制御することで、データの多様性がLLM推論における一般化の強力な予測要因になり得ることを示します。モデル誘起勾配のエントロピーを通して多様性を定量化する計量であるG-Vendiを導入する。勾配に対して小さなオフ・ザ・シェルフプロキシモデルを使用しているにもかかわらず、G-Vendiは、自然言語推論(NLI)と算術推論タスクの両方でOOD(out-of-distribution)のパフォーマンスと強い相関(Spearmanの$\rho \approx 0.9$)を達成して、オルタナティブ測度を一貫して上回っている。この知見に基づいて、勾配空間の未表現領域を対象とし、多様な合成データを生成するためのフレームワークであるPrismatic Synthesisを提案する。実験結果から、Prismatic Synsrationは、非配布テストだけでなく、非配布ベンチマークにおいても、我々の20倍のデータジェネレータに依存する最先端モデルよりもはるかに優れた、合成データのスケールにおいて、モデルパフォーマンスを継続的に向上させることが示された。例えば、PrismMath-7Bは32B LLMから抽出したもので、R1-Distill-Qwen-7B(671B R1で生成されたプロプライエタリなデータに基づいてトレーニングされたのと同じベースモデル)を7つの挑戦的なベンチマークのうち6つで上回っています。

論文の概要: Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

関連論文リスト