Related papers: A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms

A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms

URL: http://arxiv.org/abs/2307.01231v2
Date: Mon, 13 Nov 2023 04:49:28 GMT
Title: A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms
Authors: George Papadakis, Nishadi Kirielle, Peter Christen, Themis Palpanas
Abstract summary: We propose four approaches to assessing the difficulty and appropriateness of 13 established datasets. We show that most of the popular datasets pose rather easy classification tasks. We propose a new methodology for yielding benchmark datasets.
Score: 11.264467955516706
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.

Related papers

Evaluating LLMs on Entity Disambiguation in Tables [0.9786690381850356]
This work proposes an extensive evaluation of four STI SOTA approaches: Alligator (formerly s-elbat), Dagobah, TURL, and TableLlama. We also include in the evaluation both GPT-4o and GPT-4o-mini, since they excel in various public benchmarks.
arXiv Detail & Related papers (2024-08-12T18:01:50Z)
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding [76.67608003501479]
We introduce POGEMA, a comprehensive set of tools that includes a fast environment for learning, a problem instance generator, and a visualization toolkit. We also introduce and define an evaluation protocol that specifies a range of domain-related metrics, computed based on primary evaluation indicators. The results of this comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z)
Multivariate Time Series Anomaly Detection: Fancy Algorithms and Flawed Evaluation Methodology [2.043517674271996]
We discuss how a normally good protocol may have weaknesses in the context of MVTS anomaly detection. We propose a simple, yet challenging, baseline based on Principal Components Analysis (PCA) that surprisingly outperforms many recent Deep Learning (DL) based approaches on popular benchmark datasets.
arXiv Detail & Related papers (2023-08-24T20:24:12Z)
Benchmarking Neural Network Training Algorithms [52.890134877995195]
Training algorithms are an essential part of every deep learning pipeline.<n>As a community, we are unable to reliably identify training algorithm improvements.<n>We introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
arXiv Detail & Related papers (2023-06-12T15:21:02Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)
Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data. We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z)
Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation. We develop a new adversarial learning based method, which is simple and efficient to apply. We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z)
Multiple-criteria Based Active Learning with Fixed-size Determinantal Point Processes [43.71112693633952]
We introduce a multiple-criteria based active learning algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity. We show that our method performs significantly better and is more stable than other multiple-criteria based AL algorithms.
arXiv Detail & Related papers (2021-07-04T13:22:54Z)
Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning. We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class. We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z)
Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks. We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z)
ALdataset: a benchmark for pool-based active learning [1.9308522511657449]
Active learning (AL) is a subfield of machine learning (ML) in which a learning algorithm could achieve good accuracy with less training samples by interactively querying a user/oracle to label new data points. Pool-based AL is well-motivated in many ML tasks, where unlabeled data is abundant, but their labels are hard to obtain. We present experiment results for various active learning strategies, both recently proposed and classic highly-cited methods, and draw insights from the results.
arXiv Detail & Related papers (2020-10-16T04:37:29Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
Fase-AL -- Adaptation of Fast Adaptive Stacking of Ensembles for Supporting Active Learning [0.0]
This work presents the FASE-AL algorithm which induces classification models with non-labeled instances using Active Learning. The algorithm achieves promising results in terms of the percentage of correctly classified instances.
arXiv Detail & Related papers (2020-01-30T17:25:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.