Holistic Robust Data-Driven Decisions
- URL: http://arxiv.org/abs/2207.09560v3
- Date: Wed, 16 Aug 2023 22:24:53 GMT
- Title: Holistic Robust Data-Driven Decisions
- Authors: Amine Bennouna and Bart Van Parys
- Abstract summary: Practical overfitting can typically not be attributed to a single cause but instead is caused by several factors all at once.
We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise which occurs when the data points are measured only with finite precision, and finally (iii) data misspecification in which a small fraction of all data may be wholly corrupted.
We argue that although existing data-driven formulations may be robust against one of these three sources in isolation they do not provide holistic protection against all overfitting sources simultaneously.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The design of data-driven formulations for machine learning and
decision-making with good out-of-sample performance is a key challenge. The
observation that good in-sample performance does not guarantee good
out-of-sample performance is generally known as overfitting. Practical
overfitting can typically not be attributed to a single cause but instead is
caused by several factors all at once. We consider here three overfitting
sources: (i) statistical error as a result of working with finite sample data,
(ii) data noise which occurs when the data points are measured only with finite
precision, and finally (iii) data misspecification in which a small fraction of
all data may be wholly corrupted. We argue that although existing data-driven
formulations may be robust against one of these three sources in isolation they
do not provide holistic protection against all overfitting sources
simultaneously. We design a novel data-driven formulation which does guarantee
such holistic protection and is furthermore computationally viable. Our
distributionally robust optimization formulation can be interpreted as a novel
combination of a Kullback-Leibler and Levy-Prokhorov robust optimization
formulation which is novel in its own right. However, we show how in the
context of classification and regression problems that several popular
regularized and robust formulations reduce to a particular case of our proposed
novel formulation. Finally, we apply the proposed HR formulation on a portfolio
selection problem with real stock data, and analyze its risk/return tradeoff
against several benchmarks formulations. Our experiments show that our novel
ambiguity set provides a significantly better risk/return trade-off.
Related papers
- Learning from Noisy Labels via Conditional Distributionally Robust Optimization [5.85767711644773]
crowdsourcing has emerged as a practical solution for labeling large datasets.
It presents a significant challenge in learning accurate models due to noisy labels from annotators with varying levels of expertise.
arXiv Detail & Related papers (2024-11-26T05:03:26Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Non-Convex Robust Hypothesis Testing using Sinkhorn Uncertainty Sets [18.46110328123008]
We present a new framework to address the non-robust hypothesis testing problem.
The goal is to seek the optimal detector that minimizes the maximum numerical risk.
arXiv Detail & Related papers (2024-03-21T20:29:43Z) - The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference
with Partially Labeled Data [10.021381302215062]
In real-world scenarios, data collection limitations often result in partially labeled datasets, leading to difficulties in drawing reliable causal inferences.
Traditional approaches in the semi-parametric (SS) and missing data literature may not adequately handle these complexities, leading to biased estimates.
This framework tackles missing outcomes in high-dimensional settings and accounts for selection bias.
arXiv Detail & Related papers (2023-05-22T07:37:12Z) - Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework.
AUR consists of a new uncertainty estimator along with a normal recommender model.
As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z) - Distributionally robust risk evaluation with a causality constraint and structural information [0.0]
We approximate test functions by neural networks and prove the sample complexity with Rademacher complexity.
Our framework outperforms the classic counterparts in the distributionally robust portfolio selection problem.
arXiv Detail & Related papers (2022-03-20T14:48:37Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Learning and Decision-Making with Data: Optimal Formulations and Phase
Transitions [0.0]
We study the problem of designing optimal learning and decision-making formulations when only historical data is available.
We show the existence of three distinct out-of-sample performance regimes.
arXiv Detail & Related papers (2021-09-14T18:20:15Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.