Learning under Selective Labels with Data from Heterogeneous
Decision-makers: An Instrumental Variable Approach
- URL: http://arxiv.org/abs/2306.07566v2
- Date: Sat, 24 Jun 2023 01:20:39 GMT
- Title: Learning under Selective Labels with Data from Heterogeneous
Decision-makers: An Instrumental Variable Approach
- Authors: Jian Chen, Zhehao Li, Xiaojie Mao
- Abstract summary: We study the problem of learning with selectively labeled data, which arises when outcomes are only partially labeled due to historical decision-making.
We propose a weighted learning approach that learns prediction rules robust to the label selection bias in both identification settings.
- Score: 7.629248625993988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning with selectively labeled data, which arises
when outcomes are only partially labeled due to historical decision-making. The
labeled data distribution may substantially differ from the full population,
especially when the historical decisions and the target outcome can be
simultaneously affected by some unobserved factors. Consequently, learning with
only the labeled data may lead to severely biased results when deployed to the
full population. Our paper tackles this challenge by exploiting the fact that
in many applications the historical decisions were made by a set of
heterogeneous decision-makers. In particular, we analyze this setup in a
principled instrumental variable (IV) framework. We establish conditions for
the full-population risk of any given prediction rule to be point-identified
from the observed data and provide sharp risk bounds when the point
identification fails. We further propose a weighted learning approach that
learns prediction rules robust to the label selection bias in both
identification settings. Finally, we apply our proposed approach to a
semi-synthetic financial dataset and demonstrate its superior performance in
the presence of selection bias.
Related papers
- Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - Learning with Complementary Labels Revisited: The Selected-Completely-at-Random Setting Is More Practical [66.57396042747706]
Complementary-label learning is a weakly supervised learning problem.
We propose a consistent approach that does not rely on the uniform distribution assumption.
We find that complementary-label learning can be expressed as a set of negative-unlabeled binary classification problems.
arXiv Detail & Related papers (2023-11-27T02:59:17Z) - Probabilistic Test-Time Generalization by Variational Neighbor-Labeling [62.158807685159736]
This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains.
Probability pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time.
Variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels.
arXiv Detail & Related papers (2023-07-08T18:58:08Z) - Statistical Inference Under Constrained Selection Bias [20.862583584531322]
We propose a framework that enables statistical inference in the presence of selection bias.
The output is high-probability bounds on the value of an estimand for the target distribution.
We analyze the computational and statistical properties of methods to estimate these bounds and show that our method can produce informative bounds on a variety of simulated and semisynthetic tasks.
arXiv Detail & Related papers (2023-06-05T23:05:26Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Ex-Ante Assessment of Discrimination in Dataset [20.574371560492494]
Data owners face increasing liability for how the use of their data could harm under-priviliged communities.
We propose FORESEE, a FORESt of decision trEEs algorithm, which generates a score that captures how likely an individual's response varies with sensitive attributes.
arXiv Detail & Related papers (2022-08-16T19:28:22Z) - Bounding Counterfactuals under Selection Bias [60.55840896782637]
We propose a first algorithm to address both identifiable and unidentifiable queries.
We prove that, in spite of the missingness induced by the selection bias, the likelihood of the available data is unimodal.
arXiv Detail & Related papers (2022-07-26T10:33:10Z) - Social Bias Meets Data Bias: The Impacts of Labeling and Measurement
Errors on Fairness Criteria [4.048444203617942]
We consider two forms of dataset bias: errors by prior decision makers in the labeling process, and errors in measurement of the features of disadvantaged individuals.
We analytically show that some constraints can remain robust when facing certain statistical biases, while others (such as Equalized Odds) are significantly violated if trained on biased data.
Our findings present an additional guideline for choosing among existing fairness criteria, or for proposing new criteria, when available datasets may be biased.
arXiv Detail & Related papers (2022-05-31T22:43:09Z) - On robust risk-based active-learning algorithms for enhanced decision
support [0.0]
Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins.
The paper proposes two novel approaches to counteract the effects of sampling bias: textitsemi-supervised learning, and textitdiscriminative classification models.
arXiv Detail & Related papers (2022-01-07T17:25:41Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Dynamic Selection in Algorithmic Decision-making [9.172670955429906]
This paper identifies and addresses dynamic selection problems in online learning algorithms with endogenous data.
A novel bias (self-fulfilling bias) arises because the endogeneity of the data influences the choices of decisions.
We propose an instrumental-variable-based algorithm to correct for the bias.
arXiv Detail & Related papers (2021-08-28T01:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.