Fugu-MT 論文翻訳(概要): Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

論文の概要: Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

arxiv url: http://arxiv.org/abs/2605.12906v1
Date: Wed, 13 May 2026 02:33:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.764796
Title: Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
Title（参考訳）: LLMファインチューニングにおけるデータ難易度と一般化--エクストラポーレーショントレードオフ
Authors: Siyuan Liu, Tinghong Chen, Xinghan Li, Yifei Wang, Jingzhao Zhang,
Abstract要約: 教師付き微調整中のデータ選択は、大規模言語モデル(LLM)の振る舞いを批判的に変えることができる本研究では,実験と理論の両方の観点から,データの微調整における難易度の役割について検討する。固定データ予算では、SFTに最適なデータ困難が存在し、データ予算が増加するにつれて、この最適な困難はより難しいデータへと移行することを示す。
参考スコア（独自算出の注目度）: 21.945877611442867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.
Abstract（参考訳）: 教師付き微調整(SFT)中のデータ選択は、大きな言語モデル(LLM)の振る舞いを批判的に変更することができる。既存の研究は、難易度、難易度、長さなどのヒューリスティックスに基づいたデータ選択の効果について研究してきたが、報告された結果はしばしば矛盾または文脈依存である。本研究では,実験と理論の両方の観点から,データの微調整におけるデータの難易度の役割を体系的に研究し,普遍的に最適な難易度が存在しないことを確認する。固定データ予算では、SFTに最適なデータ困難が存在し、データ予算が増加するにつれて、この最適な困難はより難しいデータへと移行することを示す。この現象を説明するために、我々は単純なメカニズムである(分布内)一般化ギャップと外挿ギャップとの相互作用を明らかにする制御された合成実験を行った。我々は、PAC-ベイジアン一般化境界を用いた理論的解析により、このメカニズムをさらに支持する。以上の結果から,データサイズと難易度がSFTの一般化と外挿のトレードオフにどのように影響するかを明らかにするとともに,特定のモデルとデータ条件下での難易度に基づくデータ選択のガイダンスを提供する。

論文の概要: Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

関連論文リスト