Related papers: A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

URL: http://arxiv.org/abs/2411.15616v1
Date: Sat, 23 Nov 2024 17:35:23 GMT
Title: A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation
Authors: Vennela Yarabolu, Govind Waghmare, Sonia Gupta, Siddhartha Asthana,
Abstract summary: In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness.
Score: 0.562479170374811
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift, a phenomenon where discrepancies between historical training data and future test data lead to significant performance degradation and operational inefficiencies. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data, and focus primarily on either covariate drift or concept drift. These methods face issues such as high resource demands, inability to manage all types of drifts effectively, and neglecting the valuable context that historical data can provide. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness. This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner. Our framework employs sophisticated data segmentation techniques to identify optimal data batches that accurately reflect test data patterns. These data batches are then utilized for training on test data, ensuring that the models remain relevant and accurate over time. By leveraging the advantages of both data segmentation and scalable drift management, our solution ensures robust model accuracy and operational efficiency in large-scale ML deployments. It also minimizes resource consumption and computational overhead by selecting and utilizing relevant data subsets, leading to significant cost savings. Experimental results on classification task on real-world and synthetic datasets show our approach improves model accuracy while reducing operational costs and latency. This practical solution overcomes inefficiencies in current methods, providing a robust, adaptable, and scalable approach.

Related papers

History Is Not Enough: An Adaptive Dataflow System for Financial Time-Series Synthesis [25.486090554711797]
"History Is Not Enough" underscores the need for adaptive data generation that learns to evolve with the market.<n>We present a drift-aware dataflow system that integrates machine learning-based adaptive control into the data curation process.
arXiv Detail & Related papers (2026-01-15T07:38:59Z)
Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning [49.10890099624699]
We introduce a dynamic dataset pruning framework that adaptively selects training samples based on task-driven difficulty and cross-modality semantic consistency.<n>Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
arXiv Detail & Related papers (2025-07-17T03:08:26Z)
Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value [18.858879113762917]
We propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently.<n>Our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data.<n>This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
arXiv Detail & Related papers (2025-05-22T02:46:03Z)
Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation [4.030723722142048]
This paper tackles the challenges associated with the unstructured and heterogeneous nature of webcrawl datasets. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators (EcoDatum) EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework. It ranked 1st on the DataComp leaderboard, with an average performance score of 0.182 across 38 diverse evaluation datasets.
arXiv Detail & Related papers (2025-02-12T08:40:57Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
SUDS: A Strategy for Unsupervised Drift Sampling [0.5437605013181142]
Supervised machine learning encounters concept drift, where the data distribution changes over time, degrading performance. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms. Our results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments.
arXiv Detail & Related papers (2024-11-05T10:55:29Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data. One key challenge in federated learning is to handle non-identically distributed data across the clients. We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z)
Quilt: Robust Data Segment Selection against Concept Drifts [30.62320149405819]
Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data and tend to discard the drifted historical data. We propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy.
arXiv Detail & Related papers (2023-12-15T11:10:34Z)
A Conditioned Unsupervised Regression Framework Attuned to the Dynamic Nature of Data Streams [0.0]
This paper presents an optimal strategy for streaming contexts with limited labeled data, introducing an adaptive technique for unsupervised regression. The proposed method leverages a sparse set of initial labels and introduces an innovative drift detection mechanism. To enhance adaptability, we integrate the ADWIN (ADaptive WINdowing) algorithm with error generalization based on Root Mean Square Error (RMSE)
arXiv Detail & Related papers (2023-12-12T19:23:54Z)
Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Hyperparameter-free Continuous Learning for Domain Classification in Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU) Most existing continual learning approaches suffer from low accuracy and performance fluctuation. We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z)
Training Deep Normalizing Flow Models in Highly Incomplete Data Scenarios with Prior Regularization [13.985534521589257]
We propose a novel framework to facilitate the learning of data distributions in high paucity scenarios. The proposed framework naturally stems from posing the process of learning from incomplete data as a joint optimization task.
arXiv Detail & Related papers (2021-04-03T20:57:57Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.