Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
- URL: http://arxiv.org/abs/2411.05224v1
- Date: Thu, 07 Nov 2024 22:36:19 GMT
- Title: Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
- Authors: Varvara Arzt, Allan Hanbury,
- Abstract summary: This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP.
Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details.
While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
- Score: 5.632231145349045
- License:
- Abstract: This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP, with a focus on the relation extraction (RE) task. Existing RE benchmarks often suffer from insufficient documentation, lacking crucial details such as data sources, inter-annotator agreement, the algorithms used for the selection of instances for datasets, and information on potential biases like dataset imbalance. Progress in RE is frequently measured by leaderboards that rank systems based on evaluation methods, typically limited to aggregate metrics like F1-score. However, the absence of detailed performance analysis beyond these metrics can obscure the true generalisation capabilities of models. Our analysis reveals that widely used RE benchmarks, such as TACRED and NYT, tend to be highly imbalanced and contain noisy labels. Moreover, the lack of class-based performance metrics fails to accurately reflect model performance across datasets with a large number of relation types. These limitations should be carefully considered when reporting progress in RE. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well. Rather than undermining the significance and value of existing RE benchmarks and the development of new models, this paper advocates for improved documentation and more rigorous evaluation to advance the field.
Related papers
- Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts [0.6282171844772422]
Training data for many Large Language Models (LLMs) is contaminated with test data.
Public benchmark scores do not always accurately assess model properties.
arXiv Detail & Related papers (2024-10-11T20:46:56Z) - Measuring Orthogonality in Representations of Generative Models [81.13466637365553]
In unsupervised representation learning, models aim to distill essential features from high-dimensional data into lower-dimensional learned representations.
Disentanglement of independent generative processes has long been credited with producing high-quality representations.
We propose two novel metrics: Importance-Weighted Orthogonality (IWO) and Importance-Weighted Rank (IWR)
arXiv Detail & Related papers (2024-07-04T08:21:54Z) - OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework [21.87740178652843]
Causal discovery offers a promising approach to improve transparency and reliability.
We propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects.
We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms.
arXiv Detail & Related papers (2024-06-07T03:09:22Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive.
We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system.
eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance [0.562479170374811]
Per-IMage Overlap (PIMO) is a novel metric that addresses the shortcomings of AUROC and AUPRO.
measuring recall per image simplifies computation and is more robust to noisy annotations.
Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights.
arXiv Detail & Related papers (2024-01-03T21:24:44Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.