Ground Truth Inference for Weakly Supervised Entity Matching
- URL: http://arxiv.org/abs/2211.06975v1
- Date: Sun, 13 Nov 2022 17:57:07 GMT
- Title: Ground Truth Inference for Weakly Supervised Entity Matching
- Authors: Renzhi Wu, Alexander Bendeck, Xu Chu, Yeye He
- Abstract summary: We propose a simple but powerful labeling model for weak supervision tasks.
We then tailor the labeling model specifically to the task of entity matching.
We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
- Score: 76.6732856489872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entity matching (EM) refers to the problem of identifying pairs of data
records in one or more relational tables that refer to the same entity in the
real world. Supervised machine learning (ML) models currently achieve
state-of-the-art matching performance; however, they require many labeled
examples, which are often expensive or infeasible to obtain. This has inspired
us to approach data labeling for EM using weak supervision. In particular, we
use the labeling function abstraction popularized by Snorkel, where each
labeling function (LF) is a user-provided program that can generate many noisy
match/non-match labels quickly and cheaply. Given a set of user-written LFs,
the quality of data labeling depends on a labeling model to accurately infer
the ground-truth labels. In this work, we first propose a simple but powerful
labeling model for general weak supervision tasks. Then, we tailor the labeling
model specifically to the task of entity matching by considering the
EM-specific transitivity property.
The general form of our labeling model is simple while substantially
outperforming the best existing method across ten general weak supervision
datasets. To tailor the labeling model for EM, we formulate an approach to
ensure that the final predictions of the labeling model satisfy the
transitivity property required in EM, utilizing an exact solution where
possible and an ML-based approximation in remaining cases. On two single-table
and nine two-table real-world EM datasets, we show that our labeling model
results in a 9% higher F1 score on average than the best existing method. We
also show that a deep learning EM end model (DeepMatcher) trained on labels
generated from our weak supervision approach is comparable to an end model
trained using tens of thousands of ground-truth labels, demonstrating that our
approach can significantly reduce the labeling efforts required in EM.
Related papers
- Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance [21.926934384262594]
Large language models (LLMs) offer new opportunities to enhance the annotation process.
We compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency.
Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance.
arXiv Detail & Related papers (2024-10-24T16:27:03Z) - Deep Partial Multi-Label Learning with Graph Disambiguation [27.908565535292723]
We propose a novel deep Partial multi-Label model with grAph-disambIguatioN (PLAIN)
Specifically, we introduce the instance-level and label-level similarities to recover label confidences.
At each training epoch, labels are propagated on the instance and label graphs to produce relatively accurate pseudo-labels.
arXiv Detail & Related papers (2023-05-10T04:02:08Z) - Leveraging Instance Features for Label Aggregation in Programmatic Weak
Supervision [75.1860418333995]
Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently.
The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources as labeling functions.
Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process.
arXiv Detail & Related papers (2022-10-06T07:28:53Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - Mining Multi-Label Samples from Single Positive Labels [32.10330097419565]
Conditional generative adversarial networks (cGANs) have shown superior results in class-conditional generation tasks.
To simultaneously control multiple conditions, cGANs require multi-label training datasets, where multiple labels can be assigned to each data instance.
We propose a novel sampling approach called single-to-multi-label (S2M) sampling, based on the Markov chain Monte Carlo method.
arXiv Detail & Related papers (2022-06-12T15:14:29Z) - One Positive Label is Sufficient: Single-Positive Multi-Label Learning
with Label Enhancement [71.9401831465908]
We investigate single-positive multi-label learning (SPMLL) where each example is annotated with only one relevant label.
A novel method named proposed, i.e., Single-positive MultI-label learning with Label Enhancement, is proposed.
Experiments on benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-06-01T14:26:30Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - Group-aware Label Transfer for Domain Adaptive Person Re-identification [179.816105255584]
Unsupervised Adaptive Domain (UDA) person re-identification (ReID) aims at adapting the model trained on a labeled source-domain dataset to a target-domain dataset without any further annotations.
Most successful UDA-ReID approaches combine clustering-based pseudo-label prediction with representation learning and perform the two steps in an alternating fashion.
We propose a Group-aware Label Transfer (GLT) algorithm, which enables the online interaction and mutual promotion of pseudo-label prediction and representation learning.
arXiv Detail & Related papers (2021-03-23T07:57:39Z) - Label Confusion Learning to Enhance Text Classification Models [3.0251266104313643]
Label Confusion Model (LCM) learns label confusion to capture semantic overlap among labels.
LCM can generate a better label distribution to replace the original one-hot label vector.
experiments on five text classification benchmark datasets reveal the effectiveness of LCM for several widely used deep learning classification models.
arXiv Detail & Related papers (2020-12-09T11:34:35Z) - An Empirical Study on Large-Scale Multi-Label Text Classification
Including Few and Zero-Shot Labels [49.036212158261215]
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications.
Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs)
We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs.
We propose a new state-of-the-art method which combines BERT with LWANs.
arXiv Detail & Related papers (2020-10-04T18:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.