Related papers: Characterising harmful data sources when constructing multi-fidelity surrogate models

Characterising harmful data sources when constructing multi-fidelity surrogate models

URL: http://arxiv.org/abs/2403.08118v1
Date: Tue, 12 Mar 2024 22:57:53 GMT
Title: Characterising harmful data sources when constructing multi-fidelity surrogate models
Authors: Nicolau Andr\'es-Thi\'o, Mario Andr\'es Mu\~noz, Kate Smith-Miles
Abstract summary: We present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used.
Score: 2.3020018305241337
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.

Related papers

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation? [4.388282062290401]
We evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complement among data sources, and the model design.
arXiv Detail & Related papers (2025-03-25T14:45:23Z)
Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data. Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability. Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z)
Improving Heterogeneous Model Reuse by Density Estimation [105.97036205113258]
This paper studies multiparty learning, aiming to learn a model using the private data of different participants. Model reuse is a promising solution for multiparty learning, assuming that a local model has been trained for each party.
arXiv Detail & Related papers (2023-05-23T09:46:54Z)
New methods for new data? An overview and illustration of quantitative inductive methods for HRM research [0.0]
"Data is the new oil", in short, data would be the essential source of the ongoing fourth industrial revolution. Unlike oil, there are no major issues here concerning the production of data. The methodological challenges of data valuation lie, both for practitioners and for academic researchers.
arXiv Detail & Related papers (2023-05-15T09:51:30Z)
Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact Verification Models [14.75693099720436]
We propose CrossAug, a contrastive data augmentation method for debiasing fact verification models. We employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples. The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns.
arXiv Detail & Related papers (2021-09-30T13:19:19Z)
Unsupervised Multi-source Domain Adaptation Without Access to Source Data [58.551861130011886]
Unsupervised Domain Adaptation (UDA) aims to learn a predictor model for an unlabeled domain by transferring knowledge from a separate labeled source domain. We propose a novel and efficient algorithm which automatically combines the source models with suitable weights in such a way that it performs at least as good as the best source model.
arXiv Detail & Related papers (2021-04-05T10:45:12Z)
Learning from others' mistakes: Avoiding dataset biases without modeling them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z)
Adversarial Canonical Correlation Analysis [0.0]
Canonical Correlation Analysis (CCA) is a technique used to extract common information from multiple data sources or views. Recent work has given CCA probabilistic footing in a deep learning context. Or, adversarial techniques have arisen as a powerful alternative to variational Bayesian methods in autoencoders.
arXiv Detail & Related papers (2020-05-20T20:46:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.