Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
- URL: http://arxiv.org/abs/2410.09418v1
- Date: Sat, 12 Oct 2024 07:54:01 GMT
- Title: Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
- Authors: Yi-Fan Lu, Xian-Ling Mao, Tian Lan, Chen Xu, Heyan Huang,
- Abstract summary: Current evaluation method for event extraction relies on token-level exact match.
We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
- Score: 69.38024658668887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level. Specifically, RAEE leverages Large Language Models (LLMs) as automatic evaluation agents, incorporating chain-of-thought prompting and an adaptive mechanism to achieve interpretable and adaptive evaluations for precision and recall of triggers and arguments. Extensive experimental results demonstrate that: (1) RAEE achieves a very high correlation with the human average; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, particularly underestimating the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE will be publicly released.
Related papers
- Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.
We show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data [0.0]
Confidence assessments of semantic segmentation algorithms in remote sensing are important.
We develop a model that performs confidence evaluations at the segment and pixel levels, and outputs both labels and confidence.
The main application is the evaluation of EO Foundation Models on semantic segmentation downstream tasks.
arXiv Detail & Related papers (2024-06-26T12:05:49Z) - Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem.
Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance.
We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z) - RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference [20.250550771195726]
This paper proposes RAEE, a training-free Retrieval-Augmented Early Exiting framework for efficient inference.
Experimental results demonstrate that the proposed RAEE can significantly accelerate inference.
RAEE also achieves state-of-the-art zero-shot performance on 8 classification tasks.
arXiv Detail & Related papers (2024-05-24T04:01:24Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Leveraging Uncertainty Estimates To Improve Classifier Performance [4.4951754159063295]
Binary classification involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements.
However, model scores are often not aligned with the true positivity rate.
This is especially true when the training involves a differential sampling across classes or there is distributional drift between train and test settings.
arXiv Detail & Related papers (2023-11-20T12:40:25Z) - TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction.
TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains.
We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.