ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation
- URL: http://arxiv.org/abs/2501.14956v1
- Date: Fri, 24 Jan 2025 22:44:22 GMT
- Title: ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation
- Authors: Alireza Salemi, Julian Killingback, Hamed Zamani,
- Abstract summary: ExPerT is an explainable reference-based evaluation framework for personalized text generation.
Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments.
Human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
- Score: 19.333896936153618
- License:
- Abstract: Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e., prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidence from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style -- two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
Related papers
- Beyond Turing Test: Can GPT-4 Sway Experts' Decisions? [14.964922012236498]
This paper explores how generated text impacts readers' decisions, focusing on both amateur and expert audiences.
Our findings indicate that GPT-4 can generate persuasive analyses affecting the decisions of both amateurs and professionals.
The results highlight a high correlation between real-world evaluation through audience reactions and the current multi-dimensional evaluators commonly used for generative models.
arXiv Detail & Related papers (2024-09-25T07:55:36Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users.
X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z) - Automated Evaluation of Personalized Text Generation using Large
Language Models [38.2211640679274]
We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects.
We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task.
arXiv Detail & Related papers (2023-10-17T21:35:06Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.