Learning with Selectively Labeled Data from Multiple Decision-makers
- URL: http://arxiv.org/abs/2306.07566v3
- Date: Mon, 24 Feb 2025 16:33:06 GMT
- Title: Learning with Selectively Labeled Data from Multiple Decision-makers
- Authors: Jian Chen, Zhehao Li, Xiaojie Mao,
- Abstract summary: We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making.<n>We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules.
- Score: 5.009970045326773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. We further theoretically and numerically validate the efficacy of our proposed method.
Related papers
- Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information [19.50321703079894]
We present a novel framework to uncover the weakness of the classifier via counterfactual examples.
We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets.
arXiv Detail & Related papers (2025-03-12T05:05:58Z) - Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - Learning with Complementary Labels Revisited: The Selected-Completely-at-Random Setting Is More Practical [66.57396042747706]
Complementary-label learning is a weakly supervised learning problem.
We propose a consistent approach that does not rely on the uniform distribution assumption.
We find that complementary-label learning can be expressed as a set of negative-unlabeled binary classification problems.
arXiv Detail & Related papers (2023-11-27T02:59:17Z) - Probabilistic Test-Time Generalization by Variational Neighbor-Labeling [62.158807685159736]
This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains.
Probability pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time.
Variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels.
arXiv Detail & Related papers (2023-07-08T18:58:08Z) - A Universal Unbiased Method for Classification from Aggregate
Observations [115.20235020903992]
This paper presents a novel universal method of CFAO, which holds an unbiased estimator of the classification risk for arbitrary losses.
Our proposed method not only guarantees the risk consistency due to the unbiased risk estimator but also can be compatible with arbitrary losses.
arXiv Detail & Related papers (2023-06-20T07:22:01Z) - Statistical Inference Under Constrained Selection Bias [20.862583584531322]
We propose a framework that enables statistical inference in the presence of selection bias.
The output is high-probability bounds on the value of an estimand for the target distribution.
We analyze the computational and statistical properties of methods to estimate these bounds and show that our method can produce informative bounds on a variety of simulated and semisynthetic tasks.
arXiv Detail & Related papers (2023-06-05T23:05:26Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - RISE: Robust Individualized Decision Learning with Sensitive Variables [1.5293427903448025]
A naive baseline is to ignore sensitive variables in learning decision rules, leading to significant uncertainty and bias.
We propose a decision learning framework to incorporate sensitive variables during offline training but not include them in the input of the learned decision rule during model deployment.
arXiv Detail & Related papers (2022-11-12T04:31:38Z) - Ex-Ante Assessment of Discrimination in Dataset [20.574371560492494]
Data owners face increasing liability for how the use of their data could harm under-priviliged communities.
We propose FORESEE, a FORESt of decision trEEs algorithm, which generates a score that captures how likely an individual's response varies with sensitive attributes.
arXiv Detail & Related papers (2022-08-16T19:28:22Z) - Bounding Counterfactuals under Selection Bias [60.55840896782637]
We propose a first algorithm to address both identifiable and unidentifiable queries.
We prove that, in spite of the missingness induced by the selection bias, the likelihood of the available data is unimodal.
arXiv Detail & Related papers (2022-07-26T10:33:10Z) - Social Bias Meets Data Bias: The Impacts of Labeling and Measurement
Errors on Fairness Criteria [4.048444203617942]
We consider two forms of dataset bias: errors by prior decision makers in the labeling process, and errors in measurement of the features of disadvantaged individuals.
We analytically show that some constraints can remain robust when facing certain statistical biases, while others (such as Equalized Odds) are significantly violated if trained on biased data.
Our findings present an additional guideline for choosing among existing fairness criteria, or for proposing new criteria, when available datasets may be biased.
arXiv Detail & Related papers (2022-05-31T22:43:09Z) - Determination of class-specific variables in nonparametric
multiple-class classification [0.0]
We propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class.
We report the properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations.
arXiv Detail & Related papers (2022-05-07T10:08:58Z) - On robust risk-based active-learning algorithms for enhanced decision
support [0.0]
Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins.
The paper proposes two novel approaches to counteract the effects of sampling bias: textitsemi-supervised learning, and textitdiscriminative classification models.
arXiv Detail & Related papers (2022-01-07T17:25:41Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Dynamic Selection in Algorithmic Decision-making [9.172670955429906]
This paper identifies and addresses dynamic selection problems in online learning algorithms with endogenous data.
A novel bias (self-fulfilling bias) arises because the endogeneity of the data influences the choices of decisions.
We propose an instrumental-variable-based algorithm to correct for the bias.
arXiv Detail & Related papers (2021-08-28T01:41:37Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z) - Minimax Active Learning [61.729667575374606]
Active learning aims to develop label-efficient algorithms by querying the most representative samples to be labeled by a human annotator.
Current active learning techniques either rely on model uncertainty to select the most uncertain samples or use clustering or reconstruction to choose the most diverse set of unlabeled examples.
We develop a semi-supervised minimax entropy-based active learning algorithm that leverages both uncertainty and diversity in an adversarial manner.
arXiv Detail & Related papers (2020-12-18T19:03:40Z) - SPL-MLL: Selecting Predictable Landmarks for Multi-Label Learning [87.27700889147144]
We propose to select a small subset of labels as landmarks which are easy to predict according to input (predictable) and can well recover the other possible labels (representative)
We employ the Alternating Direction Method (ADM) to solve our problem. Empirical studies on real-world datasets show that our method achieves superior classification performance over other state-of-the-art methods.
arXiv Detail & Related papers (2020-08-16T11:07:44Z) - Knowledge Distillation and Data Selection for Semi-Supervised Learning
in CTC Acoustic Models [9.496916045581736]
Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabelled data in order to improve the accuracy of speech recognition systems.
Our aim is to establish the importance of good criteria in selecting samples from a large pool of unlabelled data.
We perform empirical investigations of different data selection methods to answer this question and quantify the effect of different sampling strategies.
arXiv Detail & Related papers (2020-08-10T07:00:08Z) - An ensemble learning framework based on group decision making [7.906702226082627]
A framework for the ensemble learning (EL) method based on group decision making (GDM) has been proposed to resolve this issue.
In this framework, base learners can be considered as decision-makers, different categories can be seen as alternatives, and the precision, recall, and accuracy which can reflect the performances of the classification methods can be employed.
arXiv Detail & Related papers (2020-07-01T13:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.