Fugu-MT 論文翻訳(概要): Market-Driven Subset Selection for Budgeted Training

論文の概要: Market-Driven Subset Selection for Budgeted Training

arxiv url: http://arxiv.org/abs/2510.02456v2
Date: Mon, 20 Oct 2025 15:38:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.604474
Title: Market-Driven Subset Selection for Budgeted Training
Title（参考訳）: 予算訓練のための市場主導型サブセット選択
Authors: Ashish Jha, Valentin Leplat, AH Phan,
Abstract要約: それぞれのトレーニング例を取引可能な契約として扱う,市場ベースのフレームワークを提案する。厳格な60kの予算の下でのGSM8Kの数学的推論では、セレクタは強い単一信号基底線でパリティを達成する。本フレームワークは,逐次的推論および分類タスクのための固定的な計算予算の下で,多信号データキュレーションを統一する。
参考スコア（独自算出の注目度）: 1.7969777786551429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.
Abstract（参考訳）: 大規模なデータセット上での大規模言語モデルのトレーニングは計算に費用がかかるが、実験的な証拠は、トレーニング例のかなりの部分が最終的なパフォーマンスに最小限に寄与していることを示している。データサブセットの選択は、リソース制約の下で小さな高ユーティリティサブセットを識別することで、この非効率性に対処する。しかし、例の効用は本質的に多面的であり、不確実性、分布の希薄性、多様性の信号を含む。本稿では,各トレーニング事例を取引可能な契約として扱う市場ベースフレームワークを提案する。不均一信号はトレーダーとして機能し、単一の流動性パラメータは濃度と平滑化を制御し、トピックワイド正規化は校正集約を保証する。トークン予算は、解釈可能な長さバイアスパラメータを持つ価格毎の決定ルールを介して明示的に処理される。我々は,最大エントロピーアグリゲーションに対する理論的接続を確立し,ノイズ信号とモノトーン信号の併用による回復保証を提供する。厳格な60kの予算の下でのGSM8Kの数学的推論では、高い単信号ベースラインでパリティを達成し、低分散を示し、GPU時間オーバーヘッドは0.1以下である。 AGNewsの5-25\%の保持率での分類では、市場の定式化は安定性を改善して競争の正確さを提供する。本フレームワークは,逐次的推論および分類タスクのための固定的な計算予算の下で,多信号データキュレーションを統一する。

論文の概要: Market-Driven Subset Selection for Budgeted Training

関連論文リスト