Holistic Robust Data-Driven Decisions
- URL: http://arxiv.org/abs/2207.09560v3
- Date: Wed, 16 Aug 2023 22:24:53 GMT
- Title: Holistic Robust Data-Driven Decisions
- Authors: Amine Bennouna and Bart Van Parys
- Abstract summary: Practical overfitting can typically not be attributed to a single cause but instead is caused by several factors all at once.
We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise which occurs when the data points are measured only with finite precision, and finally (iii) data misspecification in which a small fraction of all data may be wholly corrupted.
We argue that although existing data-driven formulations may be robust against one of these three sources in isolation they do not provide holistic protection against all overfitting sources simultaneously.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The design of data-driven formulations for machine learning and
decision-making with good out-of-sample performance is a key challenge. The
observation that good in-sample performance does not guarantee good
out-of-sample performance is generally known as overfitting. Practical
overfitting can typically not be attributed to a single cause but instead is
caused by several factors all at once. We consider here three overfitting
sources: (i) statistical error as a result of working with finite sample data,
(ii) data noise which occurs when the data points are measured only with finite
precision, and finally (iii) data misspecification in which a small fraction of
all data may be wholly corrupted. We argue that although existing data-driven
formulations may be robust against one of these three sources in isolation they
do not provide holistic protection against all overfitting sources
simultaneously. We design a novel data-driven formulation which does guarantee
such holistic protection and is furthermore computationally viable. Our
distributionally robust optimization formulation can be interpreted as a novel
combination of a Kullback-Leibler and Levy-Prokhorov robust optimization
formulation which is novel in its own right. However, we show how in the
context of classification and regression problems that several popular
regularized and robust formulations reduce to a particular case of our proposed
novel formulation. Finally, we apply the proposed HR formulation on a portfolio
selection problem with real stock data, and analyze its risk/return tradeoff
against several benchmarks formulations. Our experiments show that our novel
ambiguity set provides a significantly better risk/return trade-off.
Related papers
- A Conformal Approach to Feature-based Newsvendor under Model Misspecification [2.801095519296785]
We propose a model-free and distribution-free framework inspired by conformal prediction.
We validate our framework using both simulated data and a real-world dataset from the Capital Bikeshare program in Washington, D.C.
arXiv Detail & Related papers (2024-12-17T18:34:43Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data [6.169163527464771]
This study proposes a crash data generation method based on Conditional Tabular GAN.
A crash severity model is employed to estimate the performance of classification and interpretation.
The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms original data or synthetic data generated by other resampling methods.
arXiv Detail & Related papers (2024-04-02T16:07:27Z) - The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference
with Partially Labeled Data [10.021381302215062]
In real-world scenarios, data collection limitations often result in partially labeled datasets, leading to difficulties in drawing reliable causal inferences.
Traditional approaches in the semi-parametric (SS) and missing data literature may not adequately handle these complexities, leading to biased estimates.
This framework tackles missing outcomes in high-dimensional settings and accounts for selection bias.
arXiv Detail & Related papers (2023-05-22T07:37:12Z) - Robust Direct Learning for Causal Data Fusion [14.462235940634969]
We provide a framework for integrating multi-source data that separates the treatment effect from other nuisance functions.
We also propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory.
arXiv Detail & Related papers (2022-11-01T03:33:22Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Learning and Decision-Making with Data: Optimal Formulations and Phase
Transitions [0.0]
We study the problem of designing optimal learning and decision-making formulations when only historical data is available.
We show the existence of three distinct out-of-sample performance regimes.
arXiv Detail & Related papers (2021-09-14T18:20:15Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.