Fugu-MT 論文翻訳(概要): When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts

論文の概要: When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts

arxiv url: http://arxiv.org/abs/2509.01060v2
Date: Thu, 04 Sep 2025 17:23:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-05 14:03:59.090008
Title: When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts
Title（参考訳）: 過去の過ち - 時間分布シフトによるトレーニングデータ拡張の再考
Authors: Chengyuan Yao, Yunxuan Tang, Christopher Brooks, Rene F. Kizilcec, Renzhe Yu,
Abstract要約: 本研究では,過去のデータトレーニングウィンドウの拡大が,予測モデルの性能とアルゴリズム的公正性に与える影響について検討する。公平性の観点から言えば、モデルがより偏りのある予測を生成するのは、概念のシフトの大きさが社会デマログラフ群によって異なる場合である。トレーニングウィンドウを拡張する際には、コンセプトシフトがパフォーマンスの劣化に重要な要因であることが分かりました。
参考スコア（独自算出の注目度）: 1.2797107590517534
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models. First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive. Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers. Overall, our findings caution against conventional wisdom that "more data is better" and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.
Abstract（参考訳）: 予測モデルは、通常、将来の成果を予測するために歴史的データに基づいて訓練される。より歴史的なデータに対するトレーニングはモデルの性能と堅牢性を改善すると一般的に仮定されるが、時間とともにデータ分散の変化はこれらの利点を損なう可能性がある。本研究では,共変量シフト(特徴分布の変化)と概念シフト(特徴出力関係の変化)の下での履歴データトレーニングウィンドウの拡張が,予測モデルの性能とアルゴリズム的公正性にどのように影響するかを検討する。まず,学習データにおける共変量や概念シフトの程度が異なるシナリオを探索するシミュレーション研究を行う。分布シフトが欠如している場合、より長いトレーニングウィンドウが高原に素早く到達するのを観察し、概念シフトがある場合には、実際に性能が低下する可能性がある。共変量シフトだけではモデルパフォーマンスにはあまり影響しないが、概念シフトの影響を複雑にする可能性がある。公正性の観点からは、モデルは、概念のシフトの大きさが社会デミノグラフ群によって異なるときにより偏りのある予測を生成する。第2に、米国23のマイノリティ・コミュニティ・カレッジの12年間の学生記録を用いて、教育における一般的な機械学習応用である留学生保持予測の実証的研究を行った。トレーニングウィンドウを拡張する際には、コンセプトシフトがパフォーマンスの劣化に重要な要因であることが分かりました。さらに、モデルフェアネスは、疎外人口が仲間と異なるデータ分散シフトパターンを持つ場合、妥協される。全体として,従来の知恵を「より多くのデータがより優れている」と批判し,特にデータ分散シフトに直面する場合の歴史的データの利用の重要性を強調し,モデルの性能と公平性を向上させることを目的とした。

論文の概要: When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts

関連論文リスト