Supervised machine learning techniques for data matching based on
similarity metrics
- URL: http://arxiv.org/abs/2007.04001v2
- Date: Wed, 15 Sep 2021 12:46:07 GMT
- Title: Supervised machine learning techniques for data matching based on
similarity metrics
- Authors: Pim Verschuuren, Serena Palazzo, Tom Powell, Steve Sutton, Alfred
Pilgrim, Michele Faucci Giannelli
- Abstract summary: Data matching is the field that tries to identify instances in data that refer to the same real-world entity.
In this study, machine learning techniques are combined with string similarity functions to the field of data matching.
The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Businesses, governmental bodies and NGO's have an ever-increasing amount of
data at their disposal from which they try to extract valuable information.
Often, this needs to be done not only accurately but also within a short time
frame. Clean and consistent data is therefore crucial. Data matching is the
field that tries to identify instances in data that refer to the same
real-world entity. In this study, machine learning techniques are combined with
string similarity functions to the field of data matching. A dataset of
invoices from a variety of businesses and organizations was preprocessed with a
grouping scheme to reduce pair dimensionality and a set of similarity functions
was used to quantify similarity between invoice pairs. The resulting invoice
pair dataset was then used to train and validate a neural network and a boosted
decision tree. The performance was compared with a solution from FISCAL
Technologies as a benchmark against currently available deduplication
solutions. Both the neural network and boosted decision tree showed equal to
better performance.
Related papers
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Personalized Decentralized Multi-Task Learning Over Dynamic
Communication Graphs [59.96266198512243]
We propose a decentralized and federated learning algorithm for tasks that are positively and negatively correlated.
Our algorithm uses gradients to calculate the correlations among tasks automatically, and dynamically adjusts the communication graph to connect mutually beneficial tasks and isolate those that may negatively impact each other.
We conduct experiments on a synthetic Gaussian dataset and a large-scale celebrity attributes (CelebA) dataset.
arXiv Detail & Related papers (2022-12-21T18:58:24Z) - Towards Similarity-Aware Time-Series Classification [51.2400839966489]
We study time-series classification (TSC), a fundamental task of time-series data mining.
We propose Similarity-Aware Time-Series Classification (SimTSC), a framework that models similarity information with graph neural networks (GNNs)
arXiv Detail & Related papers (2022-01-05T02:14:57Z) - Aggregation Delayed Federated Learning [20.973999078271483]
Federated learning is a distributed machine learning paradigm where multiple data owners (clients) collaboratively train one machine learning model while keeping data on their own devices.
Studies have found performance reduction with standard federated algorithms, such as FedAvg, on non-IID data.
Many existing works on handling non-IID data adopt the same aggregation framework as FedAvg and focus on improving model updates either on the server side or on clients.
In this work, we tackle this challenge by introducing redistribution rounds that delay the aggregation. We perform experiments on multiple tasks and show that the proposed framework significantly improves the performance on non-IID
arXiv Detail & Related papers (2021-08-17T04:06:10Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z) - Learning to Match Jobs with Resumes from Sparse Interaction Data using
Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching.
Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z) - Data Separability for Neural Network Classifiers and the Development of
a Separability Index [17.49709034278995]
We created the Distance-based Separability Index (DSI) to measure the separability of datasets.
We show that the DSI can indicate whether data belonging to different classes have similar distributions.
We also discussed possible applications of the DSI in the fields of data science, machine learning, and deep learning.
arXiv Detail & Related papers (2020-05-27T01:49:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.