REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction
- URL: http://arxiv.org/abs/2502.16838v2
- Date: Wed, 10 Sep 2025 15:49:00 GMT
- Title: REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction
- Authors: Omar Sharif, Joseph Gatto, Madhusudan Basak, Sarah M. Preum,
- Abstract summary: Event argument extraction identifies arguments for predefined event roles in text.<n>Existing work evaluates this task with exact match (EM), where predicted arguments must align exactly with annotated spans.<n>We introduce REGen, a Reliable Evaluation framework for Generative event argument extraction.
- Score: 6.210603343412543
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Event argument extraction identifies arguments for predefined event roles in text. Existing work evaluates this task with exact match (EM), where predicted arguments must align exactly with annotated spans. While suitable for span-based models, this approach falls short for large language models (LLMs), which often generate diverse yet semantically accurate arguments. EM severely underestimates performance by disregarding valid variations. Furthermore, EM evaluation fails to capture implicit arguments (unstated but inferable) and scattered arguments (distributed across a document). These limitations underscore the need for an evaluation framework that better captures models' actual performance. To bridge this gap, we introduce REGen, a Reliable Evaluation framework for Generative event argument extraction. REGen combines the strengths of exact, relaxed, and LLM-based matching to better align with human judgment. Experiments on six datasets show that REGen reveals an average performance gain of +23.93 F1 over EM, reflecting capabilities overlooked by prior evaluation. Human validation further confirms REGen's effectiveness, achieving 87.67% alignment with human assessments of argument correctness.
Related papers
- Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision [25.382800247901827]
DeepfakeJudge is a framework for scalable reasoning supervision and evaluation.<n>It integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models.
arXiv Detail & Related papers (2026-02-23T11:08:46Z) - Rethinking Reward Models for Multi-Domain Test-Time Scaling [91.76069784586149]
Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T04:21:14Z) - Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation [52.3707788779464]
We introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD)<n>ARC-JSD enables efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling.<n> Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements.
arXiv Detail & Related papers (2025-05-22T09:04:03Z) - CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization [90.15027447565427]
Chain of thought (CoT) generates free-text explanations that help guide a model's predictions.<n>Self-Consistency (SC) marginalizes predictions over multiple generated explanations.<n>We propose $textbfC$hain-$textbfo$f-$textbfKe$ywords (CoKe)
arXiv Detail & Related papers (2025-03-21T13:37:46Z) - Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [69.38024658668887]
Current evaluation method for event extraction relies on token-level exact match.
We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation [9.22621553566816]
This paper shows that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability.<n>We propose xFinder, a novel evaluator for answer extraction and matching in large language models (LLMs) evaluation.<n>Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42%.<n>The final judgment accuracy of xFinder reaches 97.61%, outperforming existing evaluation frameworks and judge models.
arXiv Detail & Related papers (2024-05-20T08:30:13Z) - ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement [6.035020544588768]
Event argument extraction (EAE) is the task of identifying role-specific text spans (i.e., arguments) for a given event.
We propose a hierarchical framework that extracts event arguments more cost-effectively.
We introduce LEAFER to address the challenge LLMs face in locating the exact boundary of an argument.
arXiv Detail & Related papers (2024-01-24T04:13:28Z) - CASA: Causality-driven Argument Sufficiency Assessment [79.13496878681309]
We propose CASA, a zero-shot causality-driven argument sufficiency assessment framework.
PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent.
Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments.
arXiv Detail & Related papers (2024-01-10T16:21:18Z) - Argue with Me Tersely: Towards Sentence-Level Counter-Argument
Generation [62.069374456021016]
We present the ArgTersely benchmark for sentence-level counter-argument generation.
We also propose Arg-LlaMA for generating high-quality counter-argument.
arXiv Detail & Related papers (2023-12-21T06:51:34Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - Revisiting the Role of Similarity and Dissimilarity in Best Counter
Argument Retrieval [1.7607244667735586]
We develop an efficient model for scoring counter-arguments based on similarity and dissimilarity metrics.
We propose Bipolar-encoder, a novel BERT-based model to learn an optimal representation for simultaneous similarity and dissimilarity.
Experimental results show that our proposed method can achieve the accuracy@1 of 49.04%, which significantly outperforms other baselines by a large margin.
arXiv Detail & Related papers (2023-04-18T08:13:48Z) - Retrieval-Augmented Generative Question Answering for Event Argument
Extraction [66.24622127143044]
We propose a retrieval-augmented generative QA model (R-GQA) for event argument extraction.
It retrieves the most similar QA pair and augments it as prompt to the current example's context, then decodes the arguments as answers.
Our approach outperforms substantially prior methods across various settings.
arXiv Detail & Related papers (2022-11-14T02:00:32Z) - Aspect-Controlled Neural Argument Generation [65.91772010586605]
We train a language model for argument generation that can be controlled on a fine-grained level to generate sentence-level arguments for a given topic, stance, and aspect.
Our evaluation shows that our generation model is able to generate high-quality, aspect-specific arguments.
These arguments can be used to improve the performance of stance detection models via data augmentation and to generate counter-arguments.
arXiv Detail & Related papers (2020-04-30T20:17:22Z) - Same Side Stance Classification Task: Facilitating Argument Stance
Classification by Fine-tuning a BERT Model [8.8896707993459]
The same side stance classification task provides a dataset of argument pairs classified by whether or not both arguments share the same stance.
We fine-tuned a pre-trained BERT model for three epochs and used the first 512 tokens of each argument to predict if two arguments share the same stance.
arXiv Detail & Related papers (2020-04-23T13:54:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.