Fugu-MT 論文翻訳(概要): Computational Budget Should Be Considered in Data Selection

論文の概要: Computational Budget Should Be Considered in Data Selection

arxiv url: http://arxiv.org/abs/2510.16806v2
Date: Sun, 02 Nov 2025 08:01:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 02:21:43.164872
Title: Computational Budget Should Be Considered in Data Selection
Title（参考訳）: データ選択における計算予算の検討
Authors: Weilin Wan, Weizhong Zhang, Cheng Jin,
Abstract要約: データ選択戦略には計算予算が不可欠であるべきだと我々は主張する。本稿では,新しい計算予算対応データ選択法を提案する。本手法は,視覚および言語ベンチマークのベースラインを最大14.42%上回る性能向上を実現する。
参考スコア（独自算出の注目度）: 21.598075666695483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. Extensive experiments show that our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.
Abstract（参考訳）: データ選択は、トレーニングサンプルの情報サブセットを選択することにより、計算効率を向上させる。しかし、既存の手法は計算予算を無視し、計算予算の制約とは無関係にデータ選択と重要度評価を扱います。しかし、実証的な研究は、アルゴリズムが様々な予算で他のもの(あるいはランダムな選択)を一貫して上回ることができないことを示している。計算予算は,データ量,品質,分散性に異なる要件を課すため,データ選択戦略に不可欠なものでなければならない,と我々は主張する。そこで本研究では,計算予算を考慮したCADS(Computational budget-Aware Data Selection)手法を提案する。この手法では,内部ループが学習データの選択したサブセットに対して,計算予算の制約内でモデルをトレーニングし,外部ループはモデル評価に基づいてデータ選択を最適化する。我々の技術的貢献は、この二段階最適化問題の解決における2つの主な課題、すなわち、外ループ勾配に対する高価なヘッセン行列推定と、反復中にインナーループ最適性を達成するための計算負担に対処することにある。最初の課題を解決するために、確率的再パラメータ化戦略を提案し、Hessianフリーポリシー勾配推定器を用いて勾配を計算する。第2の課題に対処するために、我々は内部最適化問題を外的目的のペナルティ項に変換し、さらに勾配を計算するために1次元の損失の最小値を見積もるだけで、効率が大幅に向上することを発見した。大規模な実験により,視覚および言語ベンチマークのベースラインよりも最大14.42%の性能向上が得られた。

論文の概要: Computational Budget Should Be Considered in Data Selection

関連論文リスト