When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts
- URL: http://arxiv.org/abs/2509.01060v2
- Date: Thu, 04 Sep 2025 17:23:54 GMT
- Title: When the Past Misleads: Rethinking Training Data Expansion Under Temporal Distribution Shifts
- Authors: Chengyuan Yao, Yunxuan Tang, Christopher Brooks, Rene F. Kizilcec, Renzhe Yu,
- Abstract summary: This study examines how expanding historical data training windows affects the performance and algorithmic fairness of predictive models.<n>In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups.<n>We find concept shifts to be a key contributor to performance degradation when expanding the training window.
- Score: 1.2797107590517534
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Predictive models are typically trained on historical data to predict future outcomes. While it is commonly assumed that training on more historical data would improve model performance and robustness, data distribution shifts over time may undermine these benefits. This study examines how expanding historical data training windows under covariate shifts (changes in feature distributions) and concept shifts (changes in feature-outcome relationships) affects the performance and algorithmic fairness of predictive models. First, we perform a simulation study to explore scenarios with varying degrees of covariate and concept shifts in training data. Absent distribution shifts, we observe performance gains from longer training windows though they reach a plateau quickly; in the presence of concept shift, performance may actually decline. Covariate shifts alone do not significantly affect model performance, but may complicate the impact of concept shifts. In terms of fairness, models produce more biased predictions when the magnitude of concept shifts differs across sociodemographic groups; for intersectional groups, these effects are more complex and not simply additive. Second, we conduct an empirical case study of student retention prediction, a common machine learning application in education, using 12 years of student records from 23 minority-serving community colleges in the United States. We find concept shifts to be a key contributor to performance degradation when expanding the training window. Moreover, model fairness is compromised when marginalized populations have distinct data distribution shift patterns from their peers. Overall, our findings caution against conventional wisdom that "more data is better" and underscore the importance of using historical data judiciously, especially when it may be subject to data distribution shifts, to improve model performance and fairness.
Related papers
- Distribution Shift Is Key to Learning Invariant Prediction [4.138246425588323]
A large degree of distribution shift can lead to better performance even under Empirical Risk Minimization.<n>We prove that under certain data conditions, ERM solutions can achieve performance comparable to that of invariant prediction models.
arXiv Detail & Related papers (2026-01-18T07:49:57Z) - Small-to-Large Generalization: Data Influences Models Consistently Across Scale [76.87199303408161]
We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data.<n>We also characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
arXiv Detail & Related papers (2025-05-22T05:50:19Z) - Generalization vs. Specialization under Concept Shift [12.196508752999797]
We analyze ridge regression under concept shift.<n>We derive an exact expression for prediction risk in the thermodynamic limit.<n>Our experiments on MNIST and FashionMNIST suggest that this intriguing behavior is present also in classification problems.
arXiv Detail & Related papers (2024-09-23T22:30:28Z) - Ask Your Distribution Shift if Pre-Training is Right for You [67.90850628695563]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.<n>We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.<n>Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm.
We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift.
We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Context matters for fairness -- a case study on the effect of spatial
distribution shifts [10.351739012146378]
We present a case study on the newly released American Census datasets.
We show how remarkably can spatial distribution shifts affect predictive- and fairness-related performance of a model.
Our study suggests that robustness to distribution shifts is necessary before deploying a model to another context.
arXiv Detail & Related papers (2022-06-23T01:09:46Z) - Fairness Transferability Subject to Bounded Distribution Shift [5.62716254065607]
Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound?
We study the transferability of statistical group fairness for machine learning predictors subject to bounded distribution shifts.
arXiv Detail & Related papers (2022-05-31T22:16:44Z) - Bias-inducing geometries: an exactly solvable data model with fairness implications [12.532003449620607]
We introduce an exactly solvable high-dimensional model of data imbalance.<n>We analytically unpack the typical properties of learning models trained in this synthetic framework.<n>We obtain exact predictions for the observables that are commonly employed for fairness assessment.
arXiv Detail & Related papers (2022-05-31T16:27:57Z) - Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance.
We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z) - Predicting with Confidence on Unseen Distributions [90.68414180153897]
We connect domain adaptation and predictive uncertainty literature to predict model accuracy on challenging unseen distributions.
We find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts.
We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference.
arXiv Detail & Related papers (2021-07-07T15:50:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.