Automated Evaluation of Personalized Text Generation using Large
Language Models
- URL: http://arxiv.org/abs/2310.11593v1
- Date: Tue, 17 Oct 2023 21:35:06 GMT
- Title: Automated Evaluation of Personalized Text Generation using Large
Language Models
- Authors: Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu
Mei, Michael Bendersky
- Abstract summary: We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects.
We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task.
- Score: 38.2211640679274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalized text generation presents a specialized mechanism for delivering
content that is specific to a user's personal context. While the research
progress in this area has been rapid, evaluation still presents a challenge.
Traditional automated metrics such as BLEU and ROUGE primarily measure lexical
similarity to human-written references, and are not able to distinguish
personalization from other subtle semantic aspects, thus falling short of
capturing the nuances of personalized generated content quality. On the other
hand, human judgments are costly to obtain, especially in the realm of
personalized evaluation. Inspired by these challenges, we explore the use of
large language models (LLMs) for evaluating personalized text generation, and
examine their ability to understand nuanced user context. We present AuPEL, a
novel evaluation method that distills three major semantic aspects of the
generated text: personalization, quality and relevance, and automatically
measures these aspects. To validate the effectiveness of AuPEL, we design
carefully controlled experiments and compare the accuracy of the evaluation
judgments made by LLMs versus that of judgements made by human annotators, and
conduct rigorous analyses of the consistency and sensitivity of the proposed
metric. We find that, compared to existing evaluation metrics, AuPEL not only
distinguishes and ranks models based on their personalization abilities more
accurately, but also presents commendable consistency and efficiency for this
task. Our work suggests that using LLMs as the evaluators of personalized text
generation is superior to traditional text similarity metrics, even though
interesting new challenges still remain.
Related papers
- Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Learning Personalized Alignment for Evaluating Open-ended Text Generation [44.565686959174585]
PerSE is an interpretable evaluation framework designed to assess alignment with specific human preferences.
It is tuned to infer specific preferences from an in-context personal profile and evaluate the alignment between the generated content and personal preferences.
Our 13B LLaMA-2-based PerSE shows a 15.8% increase in Kendall correlation and a 13.7% rise in accuracy with zero-shot reviewers.
arXiv Detail & Related papers (2023-10-05T04:15:48Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - Automating Text Naturalness Evaluation of NLG Systems [0.0]
We present an attempt to automate the evaluation of text naturalness.
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process.
We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
arXiv Detail & Related papers (2020-06-23T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.