Fugu-MT 論文翻訳(概要): Understanding Data Influence with Differential Approximation

論文の概要: Understanding Data Influence with Differential Approximation

arxiv url: http://arxiv.org/abs/2508.14648v1
Date: Wed, 20 Aug 2025 11:59:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.444589
Title: Understanding Data Influence with Differential Approximation
Title（参考訳）: 微分近似によるデータ影響の理解
Authors: Haoru Tan, Sitong Wu, Xiuzhe Wu, Wang Wang, Bo Zhao, Zeke Xie, Gui-Song Xia, Xiaojuan Qi,
Abstract要約: 我々は,Diff-Inと呼ばれる連続学習ステップ間の影響の差を蓄積することにより,サンプルの影響を近似する新しい定式化を導入する。 2次近似を用いることで、これらの差分項を高精度に近似し、既存の手法で必要となるモデル凸性を排除した。 Diff-In は既存の影響推定器に比べて近似誤差が著しく低いことを示す。
参考スコア（独自算出の注目度）: 63.817689230826595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.
Abstract（参考訳）: データは人工知能の画期的な進歩において重要な役割を担っている。データの定量的分析は、モデルのトレーニングに大きく貢献し、データ利用効率と品質の両方を高める。しかし、既存のデータ分析ツールは、しばしば精度が低下する。例えば、これらのツールの多くは、ニューラルネットワークの損失関数が凸であることを仮定している。これらの制限は、現在のメソッドを効果的に実装することを困難にしている。本稿では,Diff-Inと呼ぶ連続学習ステップ間の影響の差を蓄積することにより,サンプルの影響を近似する新しい定式化を提案する。具体的には、連続したトレーニングイテレーション間の変化/差の累積和として、サンプルワイズの影響を定式化する。 2次近似を用いることで、これらの差分項を高精度に近似し、既存の手法で必要となるモデル凸性を排除した。 2階法であるにもかかわらず、Diff-Inは1階法に匹敵する計算複雑性を維持し、拡張性を維持している。この効率性は、一階勾配の有限差を用いて効率よく近似できるヘッセン勾配と勾配の積を計算することによって達成される。我々はDiff-Inの近似精度を理論的にも経験的にも評価する。 Diff-In は既存の影響推定器に比べて近似誤差が著しく低いことを示す。大規模な実験により、データクリーニング、データ削除、コアセットの選択という3つのデータ中心のタスクにおいて、複数のベンチマークデータセットにまたがるパフォーマンスがさらに確認される。特に、大規模な視覚言語事前学習のためのデータプルーニング実験では、Diff-Inは数百万のデータポイントにスケールでき、強力なベースラインを上回ります。

論文の概要: Understanding Data Influence with Differential Approximation

関連論文リスト