Quilt: Robust Data Segment Selection against Concept Drifts
- URL: http://arxiv.org/abs/2312.09691v1
- Date: Fri, 15 Dec 2023 11:10:34 GMT
- Title: Quilt: Robust Data Segment Selection against Concept Drifts
- Authors: Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang
- Abstract summary: Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams.
concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy.
Existing concept drift adaptation approaches mostly focus on updating the model to the new data and tend to discard the drifted historical data.
We propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy.
- Score: 30.62320149405819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous machine learning pipelines are common in industrial settings where
models are periodically trained on data streams. Unfortunately, concept drifts
may occur in data streams where the joint distribution of the data X and label
y, P(X, y), changes over time and possibly degrade model accuracy. Existing
concept drift adaptation approaches mostly focus on updating the model to the
new data possibly using ensemble techniques of previous models and tend to
discard the drifted historical data. However, we contend that explicitly
utilizing the drifted data together leads to much better model accuracy and
propose Quilt, a data-centric framework for identifying and selecting data
segments that maximize model accuracy. To address the potential downside of
efficiency, Quilt extends existing data subset selection techniques, which can
be used to reduce the training data without compromising model accuracy. These
techniques cannot be used as is because they only assume virtual drifts where
the posterior probabilities P(y|X) are assumed not to change. In contrast, a
key challenge in our setup is to also discard undesirable data segments with
concept drifts. Quilt thus discards drifted data segments and selects data
segment subsets holistically for accurate and efficient model training. The two
operations use gradient-based scores, which have little computation overhead.
In our experiments, we show that Quilt outperforms state-of-the-art drift
adaptation and data selection baselines on synthetic and real datasets.
Related papers
- A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation [0.562479170374811]
In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift.
Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data.
We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness.
arXiv Detail & Related papers (2024-11-23T17:35:23Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Data Models for Dataset Drift Controls in Machine Learning With Optical
Images [8.818468649062932]
A primary failure mode are performance drops due to differences between the training and deployment data.
Existing approaches do not account for explicit models of the primary object of interest: the data.
We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift.
arXiv Detail & Related papers (2022-11-04T16:50:10Z) - Employing chunk size adaptation to overcome concept drift [2.277447144331876]
We propose a new Chunk Adaptive Restoration framework that can be adapted to any block-based data stream classification algorithm.
The proposed algorithm adjusts the data chunk size in the case of concept drift detection to minimize the impact of the change on the predictive performance of the used model.
arXiv Detail & Related papers (2021-10-25T12:36:22Z) - Unsupervised Model Drift Estimation with Batch Normalization Statistics
for Dataset Shift Detection and Model Selection [0.0]
We propose a novel method of model drift estimation by exploiting statistics of batch normalization layer on unlabeled test data.
We show the effectiveness of our method not only on dataset shift detection but also on model selection when there are multiple candidate models among model zoo or training trajectories in an unsupervised way.
arXiv Detail & Related papers (2021-07-01T03:04:47Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven.
In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives.
Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.