Related papers: tieval: An Evaluation Framework for Temporal Information Extraction Systems

tieval: An Evaluation Framework for Temporal Information Extraction Systems

URL: http://arxiv.org/abs/2301.04643v3
Date: Fri, 24 Nov 2023 16:13:18 GMT
Title: tieval: An Evaluation Framework for Temporal Information Extraction Systems
Authors: Hugo Sousa, Al\'ipio Jorge, Ricardo Campos
Abstract summary: Temporal information extraction has attracted a great deal of interest over the last two decades. Having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. tieval is a Python library that provides a concise interface for importing different corpora and facilitates system evaluation.
Score: 2.3035364984111495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.

Related papers

Deep Reinforcement Learning based Triggering Function for Early Classifiers of Time Series [0.5399800035598185]
Early Classification of Time Series (ECTS) has been recognized as an important problem in many areas where decisions have to be taken soon. Numerous approaches have been proposed, based on different triggering functions, each taking into account various pieces of information related to the incoming time series. Experiments show that the system we describe, called scAlert, significantly outperforms its state-of-theart competitors.
arXiv Detail & Related papers (2025-02-10T15:52:55Z)
Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets [8.1990111961557]
We investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
arXiv Detail & Related papers (2025-01-20T17:59:21Z)
TSI-Bench: Benchmarking Time Series Imputation [52.27004336123575]
TSI-Bench is a comprehensive benchmark suite for time series imputation utilizing deep learning techniques. The TSI-Bench pipeline standardizes experimental settings to enable fair evaluation of imputation algorithms. TSI-Bench innovatively provides a systematic paradigm to tailor time series forecasting algorithms for imputation purposes.
arXiv Detail & Related papers (2024-06-18T16:07:33Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [70.45113777449373]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.<n>Key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data.<n>We collect a novel dataset of similarity scores that we release to the research community.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)
Deconstructing Self-Supervised Monocular Reconstruction: The Design Decisions that Matter [63.5550818034739]
This paper presents a framework to evaluate state-of-the-art contributions to self-supervised monocular depth estimation. It includes pretraining, backbone, architectural design choices and loss functions. We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset.
arXiv Detail & Related papers (2022-08-02T14:38:53Z)
A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers. By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation. To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z)
Are We There Yet? A Decision Framework for Replacing Term Based Retrieval with Dense Retrieval Systems [35.77217529138364]
Several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval. DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes.
arXiv Detail & Related papers (2022-06-26T23:16:05Z)
SAITS: Self-Attention-based Imputation for Time Series [6.321652307514677]
SAITS is a novel method based on the self-attention mechanism for missing value imputation in time series. It learns missing values from a weighted combination of two diagonally-masked self-attention blocks. Tests show SAITS outperforms state-of-the-art methods on the time-series imputation task efficiently.
arXiv Detail & Related papers (2022-02-17T08:40:42Z)
What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task. We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z)
Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal Biometric Fusion Algorithms [58.156733807470395]
This paper reports a benchmarking study carried out within the framework of the BioSecure DS2 (Access Control) evaluation campaign. The campaign targeted the application of physical access control in a medium-size establishment with some 500 persons. To the best of our knowledge, this is the first attempt to benchmark quality-based multimodal fusion algorithms.
arXiv Detail & Related papers (2021-11-17T13:39:48Z)
Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora [63.429307282665704]
Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
arXiv Detail & Related papers (2020-11-24T17:45:03Z)
Measuring the Complexity of Domains Used to Evaluate AI Systems [0.48951183832371004]
We propose a theory for measuring the complexity between varied domains. An application of this measure is then demonstrated to show its effectiveness as a tool in varied situations. We propose the future use of such a complexity metric for use in computing an AI system's intelligence.
arXiv Detail & Related papers (2020-09-18T21:53:07Z)
Bias in Multimodal AI: Testbed for Fair Automatic Recruitment [73.85525896663371]
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data. We train automatic recruitment algorithms using a set of multimodal synthetic profiles consciously scored with gender and racial biases. Our methodology and results show how to generate fairer AI-based tools in general, and in particular fairer automated recruitment systems.
arXiv Detail & Related papers (2020-04-15T15:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.