TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction
- URL: http://arxiv.org/abs/2311.09562v3
- Date: Thu, 6 Jun 2024 04:24:16 GMT
- Title: TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction
- Authors: Kuan-Hao Huang, I-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Premkumar Natarajan, Kai-Wei Chang, Nanyun Peng, Heng Ji,
- Abstract summary: We present TextEE, a standardized, fair, and reproducible benchmark for event extraction.
TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains.
We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
- Score: 131.7684896032888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current evaluation frameworks that may introduce dataset or data split bias, and the low reproducibility of some previous approaches. To address these challenges, we present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains and includes 14 recent methodologies, conducting a comprehensive benchmark reevaluation. We also evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance. Inspired by our reevaluation results and findings, we discuss the role of event extraction in the current NLP era, as well as future challenges and insights derived from TextEE. We believe TextEE, the first standardized comprehensive benchmarking tool, will significantly facilitate future event extraction research.
Related papers
- Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [69.38024658668887]
Current evaluation method for event extraction relies on token-level exact match.
We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - EBES: Easy Benchmarking for Event Sequences [17.277513178760348]
Event sequences are common data structures in various real-world domains such as healthcare, finance, and user interaction logs.
Despite advances in temporal data modeling techniques, there is no standardized benchmarks for evaluating their performance on event sequences.
We introduce EBES, a comprehensive benchmarking tool with standardized evaluation scenarios and protocols.
arXiv Detail & Related papers (2024-10-04T13:03:43Z) - Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem.
Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance.
We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - EvEval: A Comprehensive Evaluation of Event Semantics for Large Language
Models [31.704144542866636]
Events serve as fundamental units of occurrence within various contexts.
Recent studies have begun leveraging large language models (LLMs) to address event semantic processing.
We propose an overarching framework for event semantic processing, encompassing understanding, reasoning, and prediction.
arXiv Detail & Related papers (2023-05-24T15:55:40Z) - TeTIm-Eval: a novel curated evaluation data set for comparing
text-to-image models [1.1252184947601962]
evaluating and comparing text-to-image models is a challenging problem.
In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images.
Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score.
arXiv Detail & Related papers (2022-12-15T13:52:03Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.