GenRES: Rethinking Evaluation for Generative Relation Extraction in the
  Era of Large Language Models
        - URL: http://arxiv.org/abs/2402.10744v1
- Date: Fri, 16 Feb 2024 15:01:24 GMT
- Title: GenRES: Rethinking Evaluation for Generative Relation Extraction in the
  Era of Large Language Models
- Authors: Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han
- Abstract summary: We introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results.
With GenRES, we empirically identified that precision/recall fails to justify the performance of GRE methods.
Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality.
- Score: 48.56814147033251
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   The field of relation extraction (RE) is experiencing a notable shift towards
generative relation extraction (GRE), leveraging the capabilities of large
language models (LLMs). However, we discovered that traditional relation
extraction (RE) metrics like precision and recall fall short in evaluating GRE
methods. This shortfall arises because these metrics rely on exact matching
with human-annotated reference relations, while GRE methods often produce
diverse and semantically accurate relations that differ from the references. To
fill this gap, we introduce GenRES for a multi-dimensional assessment in terms
of the topic similarity, uniqueness, granularity, factualness, and completeness
of the GRE results. With GenRES, we empirically identified that (1)
precision/recall fails to justify the performance of GRE methods; (2)
human-annotated referential relations can be incomplete; (3) prompting LLMs
with a fixed set of relations or entities can cause hallucinations. Next, we
conducted a human evaluation of GRE methods that shows GenRES is consistent
with human preferences for RE quality. Last, we made a comprehensive evaluation
of fourteen leading LLMs using GenRES across document, bag, and sentence level
RE datasets, respectively, to set the benchmark for future research in GRE
 
      
        Related papers
        - Benchmarking LLMs' Judgments with No Gold Standard [8.517244114791913]
 We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs)
In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner.
We also present GRE-bench, which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
 arXiv  Detail & Related papers  (2024-11-11T16:58:36Z)
- Ground Every Sentence: Improving Retrieval-Augmented LLMs with   Interleaved Reference-Claim Generation [51.8188846284153]
 Attributed Text Generation (ATG) is proposed to enhance credibility and verifiability in RAG systems.<n>This paper proposes ReClaim, a fine-grained ATG method that alternates the generation of references and answers step by step.<n>With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.
 arXiv  Detail & Related papers  (2024-07-01T20:47:47Z)
- Sequencing Matters: A Generate-Retrieve-Generate Model for Building
  Conversational Agents [9.191944519634111]
 The Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023.
Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate.
Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
 arXiv  Detail & Related papers  (2023-11-16T02:37:58Z)
- Self-RAG: Learning to Retrieve, Generate, and Critique through
  Self-Reflection [74.51523859064802]
 We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
 arXiv  Detail & Related papers  (2023-10-17T18:18:32Z)
- Whether you can locate or not? Interactive Referring Expression
  Generation [12.148963878497243]
 We propose an Interactive REG (IREG) model that can interact with a real REC model.
IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
 arXiv  Detail & Related papers  (2023-08-19T10:53:32Z)
- Towards Multiple References Era -- Addressing Data Leakage and Limited
  Reference Diversity in NLG Evaluation [55.92852268168816]
 N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
 arXiv  Detail & Related papers  (2023-08-06T14:49:26Z)
- Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying   References [123.39034752499076]
 Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
 arXiv  Detail & Related papers  (2023-05-24T11:53:29Z)
- GPT-RE: In-context Learning for Relation Extraction using Large Language
  Models [43.968903620208444]
 GPT-RE bridges the gap between large language models and fully-supervised baselines in relation extraction.
We evaluate GPT-RE on four widely-used RE datasets, and observe that GPT-RE achieves improvements over existing GPT-3 baselines.
 arXiv  Detail & Related papers  (2023-05-03T13:28:08Z)
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
 We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
 arXiv  Detail & Related papers  (2023-03-29T12:46:54Z)
- A Hybrid Model of Classification and Generation for Spatial Relation
  Extraction [10.611528850772869]
 We first view spatial relation extraction as a generation task and propose a novel hybrid model HMCGR for this task.
 Experimental results on SpaceEval show that HMCGR outperforms the SOTA baselines significantly.
 arXiv  Detail & Related papers  (2022-08-15T01:31:44Z)
- Should We Rely on Entity Mentions for Relation Extraction? Debiasing
  Relation Extraction with Counterfactual Analysis [60.83756368501083]
 We propose the CORE (Counterfactual Analysis based Relation Extraction) debiasing method for sentence-level relation extraction.
Our CORE method is model-agnostic to debias existing RE systems during inference without changing their training processes.
 arXiv  Detail & Related papers  (2022-05-08T05:13:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.