Related papers: Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

URL: http://arxiv.org/abs/2412.07429v1
Date: Tue, 10 Dec 2024 11:40:11 GMT
Title: Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation
Authors: Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi,
Abstract summary: We present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference.<n>Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline.
Score: 2.933641361932625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.

Related papers

Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes [20.20764453136706]
Large Language Models (LLMs) are often used as automated judges to evaluate text. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to access latent knowledge and extract more accurate preferences.
arXiv Detail & Related papers (2025-03-22T12:35:25Z)
Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity [17.723293304671877]
We introduce a novel benchmark ETAPP for evaluating personalized tool invocation. To improve the accuracy of our evaluation, we propose a key-point-based evaluation method. The effectiveness of our preference-setting and key-point-based evaluation method is also validated.
arXiv Detail & Related papers (2025-03-02T07:36:22Z)
Aligning Black-box Language Models with Human Judgments [8.30794246257544]
Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks. We propose a framework to align LLM judgments with individual human evaluators or their aggregated judgments. Our approach achieves over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training.
arXiv Detail & Related papers (2025-02-07T15:19:40Z)
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance. It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We show that our approach consistently boosts DPO by a considerable margin. Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation [5.653106385738822]
Polyrating is an expressive and flexible rating system based on a maximum posteriori estimation. It can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. It can reduce the cost of human evaluations by up to $41%$ for new models and up to $77%$ for new tasks.
arXiv Detail & Related papers (2024-09-01T11:24:54Z)
CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks. To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets. We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models. It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement [79.2400720115588]
We introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data.
arXiv Detail & Related papers (2024-02-16T20:20:43Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation [9.452326973655445]
We find that metric-based methods enhance the efficiency of human evaluations by minimizing the number of required annotations. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54%. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
arXiv Detail & Related papers (2023-10-22T21:48:51Z)
Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction [15.793007223588672]
Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios.
arXiv Detail & Related papers (2023-05-10T21:43:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.