On the use of adversarial validation for quantifying dissimilarity in geospatial machine learning prediction
- URL: http://arxiv.org/abs/2404.12575v1
- Date: Fri, 19 Apr 2024 01:48:21 GMT
- Title: On the use of adversarial validation for quantifying dissimilarity in geospatial machine learning prediction
- Authors: Yanwen Wang, Mahdi Khodadadzadeh, Raul Zurita-Milla,
- Abstract summary: Cross-validation results are affected by the dissimilarity between the sample data and the prediction locations.
We propose a method to quantify such a dissimilarity in the interval 0 to 100%, and from the perspective of the data feature space.
Results show that the proposed method can successfully quantify dissimilarity across the entire range of values.
- Score: 1.1470070927586018
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent geospatial machine learning studies have shown that the results of model evaluation via cross-validation (CV) are strongly affected by the dissimilarity between the sample data and the prediction locations. In this paper, we propose a method to quantify such a dissimilarity in the interval 0 to 100%, and from the perspective of the data feature space. The proposed method is based on adversarial validation, which is an approach that can check whether sample data and prediction locations can be separated with a binary classifier. To study the effectiveness and generality of our method, we tested it on a series of experiments based on both synthetic and real datasets and with gradually increasing dissimilarities. Results show that the proposed method can successfully quantify dissimilarity across the entire range of values. Next to this, we studied how dissimilarity affects CV evaluations by comparing the results of random CV and of two spatial CV methods, namely block and spatial+ CV. Our results showed that CV evaluations follow similar patterns in all datasets and predictions: when dissimilarity is low (usually lower than 30%), random CV provides the most accurate evaluation results. As dissimilarity increases, spatial CV methods, especially spatial+ CV, become more and more accurate and even outperforming random CV. When dissimilarity is high (>=90%), no CV method provides accurate evaluations. These results show the importance of considering feature space dissimilarity when working with geospatial machine learning predictions, and can help researchers and practitioners to select more suitable CV methods for evaluating their predictions.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data [7.62566998854384]
Cross-validation is used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model.
The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data.
This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation.
arXiv Detail & Related papers (2024-08-06T12:28:16Z) - Is K-fold cross validation the best model selection method for Machine Learning? [0.0]
K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance.
A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed.
arXiv Detail & Related papers (2024-01-29T18:46:53Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Confidence intervals for the Cox model test error from cross-validation [91.3755431537592]
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model.
Standard confidence intervals for test error using estimates from CV may have coverage below nominal levels.
One way to this issue is by estimating the mean squared error of the prediction error instead using nested CV.
arXiv Detail & Related papers (2022-01-26T06:40:43Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Targeted Cross-Validation [23.689101487016266]
We propose a targeted cross-validation (TCV) to select models or procedures based on a general weighted $L$ loss.
We show that the TCV is consistent in selecting the best performing candidate under the $L$ loss.
In this work, we broaden the concept of the selection consistency by allowing the best candidate to switch as the sample size varies.
arXiv Detail & Related papers (2021-09-14T19:53:18Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Semi-Automatic Data Annotation guided by Feature Space Projection [117.9296191012968]
We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation.
We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities.
Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.
arXiv Detail & Related papers (2020-07-27T17:03:50Z) - Approximate Cross-Validation for Structured Models [20.79997929155929]
Gold standard evaluation technique is structured cross-validation (CV)
But CV here can be prohibitively slow due to the need to re-run already-expensive learning algorithms many times.
Previous work has shown approximate cross-validation (ACV) methods provide a fast and provably accurate alternative.
arXiv Detail & Related papers (2020-06-23T00:06:03Z) - Estimating the Prediction Performance of Spatial Models via Spatial
k-Fold Cross Validation [1.7205106391379026]
In machine learning one often assumes the data are independent when evaluating model performance.
spatial autocorrelation (SAC) causes the standard cross validation (CV) methods to produce optimistically biased prediction performance estimates.
We propose a modified version of the CV method called spatial k-fold cross validation (SKCV) which provides a useful estimate for model prediction performance without optimistic bias due to SAC.
arXiv Detail & Related papers (2020-05-28T19:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.