Data Models for Dataset Drift Controls in Machine Learning With Optical
Images
- URL: http://arxiv.org/abs/2211.02578v3
- Date: Sun, 7 May 2023 05:58:05 GMT
- Title: Data Models for Dataset Drift Controls in Machine Learning With Optical
Images
- Authors: Luis Oala, Marco Aversa, Gabriel Nobis, Kurt Willis, Yoan
Neuenschwander, Mich\`ele Buck, Christian Matek, Jerome Extermann, Enrico
Pomarico, Wojciech Samek, Roderick Murray-Smith, Christoph Clausen, Bruno
Sanguinetti
- Abstract summary: A primary failure mode are performance drops due to differences between the training and deployment data.
Existing approaches do not account for explicit models of the primary object of interest: the data.
We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift.
- Score: 8.818468649062932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Camera images are ubiquitous in machine learning research. They also play a
central role in the delivery of important services spanning medicine and
environmental surveying. However, the application of machine learning models in
these domains has been limited because of robustness concerns. A primary
failure mode are performance drops due to differences between the training and
deployment data. While there are methods to prospectively validate the
robustness of machine learning models to such dataset drifts, existing
approaches do not account for explicit models of the primary object of
interest: the data. This limits our ability to study and understand the
relationship between data generation and downstream machine learning model
performance in a physically accurate manner. In this study, we demonstrate how
to overcome this limitation by pairing traditional machine learning with
physical optics to obtain explicit and differentiable data models. We
demonstrate how such data models can be constructed for image data and used to
control downstream machine learning model performance related to dataset drift.
The findings are distilled into three applications. First, drift synthesis
enables the controlled generation of physically faithful drift test cases to
power model selection and targeted generalization. Second, the gradient
connection between machine learning task model and data model allows advanced,
precise tolerancing of task model sensitivity to changes in the data
generation. These drift forensics can be used to precisely specify the
acceptable data environments in which a task model may be run. Third, drift
optimization opens up the possibility to create drifts that can help the task
model learn better faster, effectively optimizing the data generating process
itself. A guide to access the open code and datasets is available at
https://github.com/aiaudit-org/raw2logit.
Related papers
- A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control [59.20038082523832]
We present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications.
We develop a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data.
arXiv Detail & Related papers (2024-03-28T14:07:13Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Quilt: Robust Data Segment Selection against Concept Drifts [30.62320149405819]
Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams.
concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy.
Existing concept drift adaptation approaches mostly focus on updating the model to the new data and tend to discard the drifted historical data.
We propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy.
arXiv Detail & Related papers (2023-12-15T11:10:34Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings.
Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data.
Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z) - A Note on Data Biases in Generative Models [16.86600007830682]
We investigate the impact of dataset quality on the performance of generative models.
We show how societal biases of datasets are replicated by generative models.
We present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes.
arXiv Detail & Related papers (2020-12-04T10:46:37Z) - It's the Best Only When It Fits You Most: Finding Related Models for
Serving Based on Dynamic Locality Sensitive Hashing [1.581913948762905]
Preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research.
This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models.
arXiv Detail & Related papers (2020-10-13T22:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.