Related papers: Active Deep Learning on Entity Resolution by Risk Sampling

Active Deep Learning on Entity Resolution by Risk Sampling

URL: http://arxiv.org/abs/2012.12960v1
Date: Wed, 23 Dec 2020 20:38:25 GMT
Title: Active Deep Learning on Entity Resolution by Risk Sampling
Authors: Youcef Nafa, Qun Chen, Zhaoqiang Chen, Xingyu Lu, Haiyang He, Tianyi Duan and Zhanhuai Li
Abstract summary: Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training. We propose a novel AL approach of risk sampling for entity resolution (ER) Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform continuity. We empirically verify the efficacy of the proposed approach on real data by a comparative study.
Score: 5.219701379581547
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While the state-of-the-art performance on entity resolution (ER) has been achieved by deep learning, its effectiveness depends on large quantities of accurately labeled training data. To alleviate the data labeling burden, Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training. Building upon the recent advances in risk analysis for ER, which can provide a more refined estimate on label misprediction risk than the simpler classifier outputs, we propose a novel AL approach of risk sampling for ER. Risk sampling leverages misprediction risk estimation for active instance selection. Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the defined weighted K-medoids problem is NP-hard, we then present an efficient heuristic algorithm. Finally, we empirically verify the efficacy of the proposed approach on real data by a comparative study. Our extensive experiments have shown that it outperforms the existing alternatives by considerable margins. Using ER as a test case, we demonstrate that risk sampling is a promising approach potentially applicable to other challenging classification tasks.

Related papers

Risk-Averse Certification of Bayesian Neural Networks [70.44969603471903]
We propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN. Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN. We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method.
arXiv Detail & Related papers (2024-11-29T14:22:51Z)
Progressive Generalization Risk Reduction for Data-Efficient Causal Effect Estimation [30.49865329385806]
Causal effect estimation (CEE) provides a crucial tool for predicting the unobserved counterfactual outcome for an entity. In this paper, we study a more realistic CEE setting where the labelled data samples are scarce at the beginning. We propose the Model Agnostic Causal Active Learning (MACAL) algorithm for batch-wise label acquisition.
arXiv Detail & Related papers (2024-11-18T03:17:40Z)
Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z)
Optimal Sample Selection Through Uncertainty Estimation and Its Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning. Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z)
Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z)
Mitigating sampling bias in risk-based active learning via an EM algorithm [0.0]
Risk-based active learning is an approach to developing statistical classifiers for online decision-support. Data-label querying is guided according to the expected value of perfect information for incipient data points. Semi-supervised approach counteracts sampling bias by incorporating pseudo-labels for unlabelled data via an EM algorithm.
arXiv Detail & Related papers (2022-06-25T08:48:25Z)
Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm. Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Adaptive Deep Learning for Entity Resolution by Risk Analysis [5.496296462160264]
This paper proposes a novel risk-based approach to tune a deep model towards a target workload by its particular characteristics. Our theoretical analysis shows that risk-based adaptive training can correct the label status of a mispredicted instance with a fairly good chance.
arXiv Detail & Related papers (2020-12-07T08:05:46Z)
SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics. We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations. We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.