Active Deep Learning on Entity Resolution by Risk Sampling
- URL: http://arxiv.org/abs/2012.12960v1
- Date: Wed, 23 Dec 2020 20:38:25 GMT
- Title: Active Deep Learning on Entity Resolution by Risk Sampling
- Authors: Youcef Nafa, Qun Chen, Zhaoqiang Chen, Xingyu Lu, Haiyang He, Tianyi
Duan and Zhanhuai Li
- Abstract summary: Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training.
We propose a novel AL approach of risk sampling for entity resolution (ER)
Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform continuity.
We empirically verify the efficacy of the proposed approach on real data by a comparative study.
- Score: 5.219701379581547
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While the state-of-the-art performance on entity resolution (ER) has been
achieved by deep learning, its effectiveness depends on large quantities of
accurately labeled training data. To alleviate the data labeling burden, Active
Learning (AL) presents itself as a feasible solution that focuses on data
deemed useful for model training. Building upon the recent advances in risk
analysis for ER, which can provide a more refined estimate on label
misprediction risk than the simpler classifier outputs, we propose a novel AL
approach of risk sampling for ER. Risk sampling leverages misprediction risk
estimation for active instance selection. Based on the core-set
characterization for AL, we theoretically derive an optimization model which
aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the
defined weighted K-medoids problem is NP-hard, we then present an efficient
heuristic algorithm. Finally, we empirically verify the efficacy of the
proposed approach on real data by a comparative study. Our extensive
experiments have shown that it outperforms the existing alternatives by
considerable margins. Using ER as a test case, we demonstrate that risk
sampling is a promising approach potentially applicable to other challenging
classification tasks.
Related papers
- Progressive Generalization Risk Reduction for Data-Efficient Causal Effect Estimation [30.49865329385806]
Causal effect estimation (CEE) provides a crucial tool for predicting the unobserved counterfactual outcome for an entity.
In this paper, we study a more realistic CEE setting where the labelled data samples are scarce at the beginning.
We propose the Model Agnostic Causal Active Learning (MACAL) algorithm for batch-wise label acquisition.
arXiv Detail & Related papers (2024-11-18T03:17:40Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Optimal Sample Selection Through Uncertainty Estimation and Its
Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning.
Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Mitigating sampling bias in risk-based active learning via an EM
algorithm [0.0]
Risk-based active learning is an approach to developing statistical classifiers for online decision-support.
Data-label querying is guided according to the expected value of perfect information for incipient data points.
Semi-supervised approach counteracts sampling bias by incorporating pseudo-labels for unlabelled data via an EM algorithm.
arXiv Detail & Related papers (2022-06-25T08:48:25Z) - Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm.
Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Adaptive Deep Learning for Entity Resolution by Risk Analysis [5.496296462160264]
This paper proposes a novel risk-based approach to tune a deep model towards a target workload by its particular characteristics.
Our theoretical analysis shows that risk-based adaptive training can correct the label status of a mispredicted instance with a fairly good chance.
arXiv Detail & Related papers (2020-12-07T08:05:46Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.