Related papers: GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

URL: http://arxiv.org/abs/2402.10744v1
Date: Fri, 16 Feb 2024 15:01:24 GMT
Title: GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models
Authors: Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han
Abstract summary: We introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that precision/recall fails to justify the performance of GRE methods. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality.
Score: 48.56814147033251
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE

Related papers

Benchmarking LLMs' Judgments with No Gold Standard [8.517244114791913]
We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs) In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner. We also present GRE-bench, which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
arXiv Detail & Related papers (2024-11-11T16:58:36Z)
Sequencing Matters: A Generate-Retrieve-Generate Model for Building Conversational Agents [9.191944519634111]
The Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023. Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate. Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
arXiv Detail & Related papers (2023-11-16T02:37:58Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)
Whether you can locate or not? Interactive Referring Expression Generation [12.148963878497243]
We propose an Interactive REG (IREG) model that can interact with a real REC model. IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
arXiv Detail & Related papers (2023-08-19T10:53:32Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references. We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z)
GPT-RE: In-context Learning for Relation Extraction using Large Language Models [43.968903620208444]
GPT-RE bridges the gap between large language models and fully-supervised baselines in relation extraction. We evaluate GPT-RE on four widely-used RE datasets, and observe that GPT-RE achieves improvements over existing GPT-3 baselines.
arXiv Detail & Related papers (2023-05-03T13:28:08Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
A Hybrid Model of Classification and Generation for Spatial Relation Extraction [10.611528850772869]
We first view spatial relation extraction as a generation task and propose a novel hybrid model HMCGR for this task. Experimental results on SpaceEval show that HMCGR outperforms the SOTA baselines significantly.
arXiv Detail & Related papers (2022-08-15T01:31:44Z)
Should We Rely on Entity Mentions for Relation Extraction? Debiasing Relation Extraction with Counterfactual Analysis [60.83756368501083]
We propose the CORE (Counterfactual Analysis based Relation Extraction) debiasing method for sentence-level relation extraction. Our CORE method is model-agnostic to debias existing RE systems during inference without changing their training processes.
arXiv Detail & Related papers (2022-05-08T05:13:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.