Intrinsic Task-based Evaluation for Referring Expression Generation
- URL: http://arxiv.org/abs/2402.07432v1
- Date: Mon, 12 Feb 2024 06:21:35 GMT
- Title: Intrinsic Task-based Evaluation for Referring Expression Generation
- Authors: Guanyi Chen, Fahime Same, Kees van Deemter
- Abstract summary: Referring Expressions (REs) generated by state-of-the-art neural models were not only indistinguishable from the REs in textscwebnlg but also from the REs generated by a simple rule-based system.
Here, we argue that this limitation could stem from the use of a purely ratings-based human evaluation.
We propose an intrinsic task-based evaluation for REG models, in which, in addition to rating the quality of REs, participants were asked to accomplish two meta-level tasks.
- Score: 9.322715583523928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, a human evaluation study of Referring Expression Generation (REG)
models had an unexpected conclusion: on \textsc{webnlg}, Referring Expressions
(REs) generated by the state-of-the-art neural models were not only
indistinguishable from the REs in \textsc{webnlg} but also from the REs
generated by a simple rule-based system. Here, we argue that this limitation
could stem from the use of a purely ratings-based human evaluation (which is a
common practice in Natural Language Generation). To investigate these issues,
we propose an intrinsic task-based evaluation for REG models, in which, in
addition to rating the quality of REs, participants were asked to accomplish
two meta-level tasks. One of these tasks concerns the referential success of
each RE; the other task asks participants to suggest a better alternative for
each RE. The outcomes suggest that, in comparison to previous evaluations, the
new evaluation protocol assesses the performance of each REG model more
comprehensively and makes the participants' ratings more reliable and
discriminable.
Related papers
- Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding [3.8673630752805446]
We propose an approach to referring expression generation (REG) that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate.
Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs.
arXiv Detail & Related papers (2024-09-09T15:33:07Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Sequencing Matters: A Generate-Retrieve-Generate Model for Building
Conversational Agents [9.191944519634111]
The Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023.
Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate.
Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
arXiv Detail & Related papers (2023-11-16T02:37:58Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Whether you can locate or not? Interactive Referring Expression
Generation [12.148963878497243]
We propose an Interactive REG (IREG) model that can interact with a real REC model.
IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
arXiv Detail & Related papers (2023-08-19T10:53:32Z) - A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers [76.51245425667845]
Relation extraction (RE) involves identifying the relations between entities from underlying content.
Deep neural networks have dominated the field of RE and made noticeable progress.
This survey is expected to facilitate researchers' collaborative efforts to address the challenges of real-world RE systems.
arXiv Detail & Related papers (2023-06-03T08:39:25Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - RISE: Leveraging Retrieval Techniques for Summarization Evaluation [3.9215337270154995]
We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval.
RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries.
We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation.
arXiv Detail & Related papers (2022-12-17T01:09:22Z) - An Overview of Distant Supervision for Relation Extraction with a Focus
on Denoising and Pre-training Methods [0.0]
Relation Extraction is a foundational task of natural language processing.
The history of RE methods can be roughly organized into four phases: pattern-based RE, statistical-based RE, neural-based RE, and large language model-based RE.
arXiv Detail & Related papers (2022-07-17T21:02:04Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.