Combining Data-driven Supervision with Human-in-the-loop Feedback for
Entity Resolution
- URL: http://arxiv.org/abs/2111.10497v1
- Date: Sat, 20 Nov 2021 02:22:12 GMT
- Title: Combining Data-driven Supervision with Human-in-the-loop Feedback for
Entity Resolution
- Authors: Wenpeng Yin, Shelby Heinecke, Jia Li, Nitish Shirish Keskar, Michael
Jones, Shouzhong Shi, Stanislav Georgiev, Kurt Milich, Joseph Esposito,
Caiming Xiong
- Abstract summary: We build a model that identifies and consolidates data points that represent the same person.
In this case study, we discuss our human-in-the-loop enabled, data-centric solution to closing the training-production performance divergence.
- Score: 47.90125404360125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The distribution gap between training datasets and data encountered in
production is well acknowledged. Training datasets are often constructed over a
fixed period of time and by carefully curating the data to be labeled. Thus,
training datasets may not contain all possible variations of data that could be
encountered in real-world production environments. Tasked with building an
entity resolution system - a model that identifies and consolidates data points
that represent the same person - our first model exhibited a clear
training-production performance gap. In this case study, we discuss our
human-in-the-loop enabled, data-centric solution to closing the
training-production performance divergence. We conclude with takeaways that
apply to data-centric learning at large.
Related papers
- Federated Impression for Learning with Distributed Heterogeneous Data [19.50235109938016]
Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data.
In FL, sub-optimal convergence is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers.
We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression.
arXiv Detail & Related papers (2024-09-11T15:37:52Z) - The Data Addition Dilemma [4.869513274920574]
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources.
But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings?
We identify this situation as the textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance.
arXiv Detail & Related papers (2024-08-08T01:42:31Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Training Data Attribution for Diffusion Models [1.1733780065300188]
We propose a novel solution that reveals how training data influence the output of diffusion models through the use of ensembles.
In our approach individual models in an encoded ensemble are trained on carefully engineered splits of the overall training data to permit the identification of influential training examples.
The resulting model ensembles enable efficient ablation of training data influence, allowing us to assess the impact of training data on model outputs.
arXiv Detail & Related papers (2023-06-03T18:36:12Z) - Integrating Local Real Data with Global Gradient Prototypes for
Classifier Re-Balancing in Federated Long-Tailed Learning [60.41501515192088]
Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively.
The data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model.
In this work, we integrate the local real data with the global gradient prototypes to form the local balanced datasets.
arXiv Detail & Related papers (2023-01-25T03:18:10Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - Federated Visual Classification with Real-World Data Distribution [9.564468846277366]
We characterize the effect real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm.
We introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits.
We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.
arXiv Detail & Related papers (2020-03-18T07:55:49Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.