Related papers: Holistic Robust Data-Driven Decisions

Holistic Robust Data-Driven Decisions

URL: http://arxiv.org/abs/2207.09560v4
Date: Sat, 01 Feb 2025 16:15:13 GMT
Title: Holistic Robust Data-Driven Decisions
Authors: Amine Bennouna, Bart Van Parys, Ryan Lucas,
Abstract summary: Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously.<n>We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted.<n>We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and L\'evy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.

Related papers

From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching [11.158961763380278]
Recent alternatives improve robustness by leveraging test-time data, but such data may be unavailable in practice.<n>We take a data-centric approach by leveraging invariant data pairs and noisy counterfactual matching.<n>We validate on a synthetic dataset and demonstrate on real-world benchmarks that linear probing on a pretrained backbone improves robustness.
arXiv Detail & Related papers (2025-05-30T17:42:32Z)
Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on Prediction-Powered Causal Inferences (PPCI)<n> PPCI estimates the treatment effect in a target experiment with unlabeled factual outcomes, retrievable zero-shot from a pre-trained model.<n>We validate our method on synthetic and real-world scientific data, offering solutions to instances not solvable by vanilla Empirical Risk Minimization.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
A Conformal Approach to Feature-based Newsvendor under Model Misspecification [2.801095519296785]
We propose a model-free and distribution-free framework inspired by conformal prediction. We validate our framework using both simulated data and a real-world dataset from the Capital Bikeshare program in Washington, D.C.
arXiv Detail & Related papers (2024-12-17T18:34:43Z)
Learning from Noisy Labels via Conditional Distributionally Robust Optimization [5.85767711644773]
crowdsourcing has emerged as a practical solution for labeling large datasets. It presents a significant challenge in learning accurate models due to noisy labels from annotators with varying levels of expertise.
arXiv Detail & Related papers (2024-11-26T05:03:26Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information. We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z)
A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data [6.169163527464771]
This study proposes a crash data generation method based on Conditional Tabular GAN. A crash severity model is employed to estimate the performance of classification and interpretation. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms original data or synthetic data generated by other resampling methods.
arXiv Detail & Related papers (2024-04-02T16:07:27Z)
Non-Convex Robust Hypothesis Testing using Sinkhorn Uncertainty Sets [18.46110328123008]
We present a new framework to address the non-robust hypothesis testing problem. The goal is to seek the optimal detector that minimizes the maximum numerical risk.
arXiv Detail & Related papers (2024-03-21T20:29:43Z)
The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference with Partially Labeled Data [10.021381302215062]
In real-world scenarios, data collection limitations often result in partially labeled datasets, leading to difficulties in drawing reliable causal inferences. Traditional approaches in the semi-parametric (SS) and missing data literature may not adequately handle these complexities, leading to biased estimates. This framework tackles missing outcomes in high-dimensional settings and accounts for selection bias.
arXiv Detail & Related papers (2023-05-22T07:37:12Z)
Robust Direct Learning for Causal Data Fusion [14.462235940634969]
We provide a framework for integrating multi-source data that separates the treatment effect from other nuisance functions. We also propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory.
arXiv Detail & Related papers (2022-11-01T03:33:22Z)
Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework. AUR consists of a new uncertainty estimator along with a normal recommender model. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z)
DRFLM: Distributionally Robust Federated Learning with Inter-client Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. We propose a general framework to solve the above two challenges simultaneously. We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z)
Distributionally robust risk evaluation with a causality constraint and structural information [0.0]
We approximate test functions by neural networks and prove the sample complexity with Rademacher complexity. Our framework outperforms the classic counterparts in the distributionally robust portfolio selection problem.
arXiv Detail & Related papers (2022-03-20T14:48:37Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Learning and Decision-Making with Data: Optimal Formulations and Phase Transitions [0.0]
We study the problem of designing optimal learning and decision-making formulations when only historical data is available. We show the existence of three distinct out-of-sample performance regimes.
arXiv Detail & Related papers (2021-09-14T18:20:15Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.