Characterising harmful data sources when constructing multi-fidelity
surrogate models
- URL: http://arxiv.org/abs/2403.08118v1
- Date: Tue, 12 Mar 2024 22:57:53 GMT
- Title: Characterising harmful data sources when constructing multi-fidelity
surrogate models
- Authors: Nicolau Andr\'es-Thi\'o, Mario Andr\'es Mu\~noz, Kate Smith-Miles
- Abstract summary: We present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model.
Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used.
- Score: 2.3020018305241337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Surrogate modelling techniques have seen growing attention in recent years
when applied to both modelling and optimisation of industrial design problems.
These techniques are highly relevant when assessing the performance of a
particular design carries a high cost, as the overall cost can be mitigated via
the construction of a model to be queried in lieu of the available high-cost
source. The construction of these models can sometimes employ other sources of
information which are both cheaper and less accurate. The existence of these
sources however poses the question of which sources should be used when
constructing a model. Recent studies have attempted to characterise harmful
data sources to guide practitioners in choosing when to ignore a certain
source. These studies have done so in a synthetic setting, characterising
sources using a large amount of data that is not available in practice. Some of
these studies have also been shown to potentially suffer from bias in the
benchmarks used in the analysis. In this study, we present a characterisation
of harmful low-fidelity sources using only the limited data available to train
a surrogate model. We employ recently developed benchmark filtering techniques
to conduct a bias-free assessment, providing objectively varied benchmark
suites of different sizes for future research. Analysing one of these benchmark
suites with the technique known as Instance Space Analysis, we provide an
intuitive visualisation of when a low-fidelity source should be used and use
this analysis to provide guidelines that can be used in an applied industrial
setting.
Related papers
- Improving Heterogeneous Model Reuse by Density Estimation [105.97036205113258]
This paper studies multiparty learning, aiming to learn a model using the private data of different participants.
Model reuse is a promising solution for multiparty learning, assuming that a local model has been trained for each party.
arXiv Detail & Related papers (2023-05-23T09:46:54Z) - New methods for new data? An overview and illustration of quantitative
inductive methods for HRM research [0.0]
"Data is the new oil", in short, data would be the essential source of the ongoing fourth industrial revolution.
Unlike oil, there are no major issues here concerning the production of data.
The methodological challenges of data valuation lie, both for practitioners and for academic researchers.
arXiv Detail & Related papers (2023-05-15T09:51:30Z) - Loss Adapted Plasticity in Deep Neural Networks to Learn from Data with
Unreliable Sources [69.6462706723023]
We show that applying this technique can significantly improve model performance when trained on a mixture of reliable and unreliable data sources.
All code to reproduce this work's experiments and implement the algorithm in the reader's own models is made available.
arXiv Detail & Related papers (2022-12-06T11:38:22Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact
Verification Models [14.75693099720436]
We propose CrossAug, a contrastive data augmentation method for debiasing fact verification models.
We employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples.
The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns.
arXiv Detail & Related papers (2021-09-30T13:19:19Z) - Unsupervised Multi-source Domain Adaptation Without Access to Source
Data [58.551861130011886]
Unsupervised Domain Adaptation (UDA) aims to learn a predictor model for an unlabeled domain by transferring knowledge from a separate labeled source domain.
We propose a novel and efficient algorithm which automatically combines the source models with suitable weights in such a way that it performs at least as good as the best source model.
arXiv Detail & Related papers (2021-04-05T10:45:12Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - Adversarial Canonical Correlation Analysis [0.0]
Canonical Correlation Analysis (CCA) is a technique used to extract common information from multiple data sources or views.
Recent work has given CCA probabilistic footing in a deep learning context.
Or, adversarial techniques have arisen as a powerful alternative to variational Bayesian methods in autoencoders.
arXiv Detail & Related papers (2020-05-20T20:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.