Learning Personalized Alignment for Evaluating Open-ended Text Generation
- URL: http://arxiv.org/abs/2310.03304v5
- Date: Fri, 04 Oct 2024 03:58:14 GMT
- Title: Learning Personalized Alignment for Evaluating Open-ended Text Generation
- Authors: Danqing Wang, Kevin Yang, Hanlin Zhu, Xiaomeng Yang, Andrew Cohen, Lei Li, Yuandong Tian,
- Abstract summary: PerSE is an interpretable evaluation framework designed to assess alignment with specific human preferences.
It is tuned to infer specific preferences from an in-context personal profile and evaluate the alignment between the generated content and personal preferences.
Our 13B LLaMA-2-based PerSE shows a 15.8% increase in Kendall correlation and a 13.7% rise in accuracy with zero-shot reviewers.
- Score: 44.565686959174585
- License:
- Abstract: Recent research has increasingly focused on evaluating large language models' (LLMs) alignment with diverse human values and preferences, particularly for open-ended tasks like story generation. Traditional evaluation metrics rely heavily on lexical similarity with human-written references, often showing poor correlation with human judgments and failing to account for alignment with the diversity of human preferences. To address these challenges, we introduce PerSE, an interpretable evaluation framework designed to assess alignment with specific human preferences. It is tuned to infer specific preferences from an in-context personal profile and evaluate the alignment between the generated content and personal preferences. PerSE enhances interpretability by providing detailed comments and fine-grained scoring, facilitating more personalized content generation. Our 13B LLaMA-2-based PerSE shows a 15.8% increase in Kendall correlation and a 13.7% rise in accuracy with zero-shot reviewers compared to GPT-4. It also outperforms GPT-4 by 46.01% in Kendall correlation on new domains, indicating its transferability.
Related papers
- CRoP: Context-wise Robust Static Human-Sensing Personalization [7.303478793193849]
We introduce CRoP, a novel static personalization approach using an off-the-shelf pre-trained model and pruning to optimize personalization and generalization.
CRoP shows superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains.
arXiv Detail & Related papers (2024-09-26T16:06:38Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Automated Evaluation of Personalized Text Generation using Large
Language Models [38.2211640679274]
We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects.
We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task.
arXiv Detail & Related papers (2023-10-17T21:35:06Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.