Related papers: LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations

URL: http://arxiv.org/abs/2404.13076v1
Date: Mon, 15 Apr 2024 16:49:59 GMT
Title: LLM Evaluators Recognize and Favor Their Own Generations
Authors: Arjun Panickssery, Samuel R. Bowman, Shi Feng,
Abstract summary: We investigate if self-recognition capability contributes to self-preference. We find a linear correlation between self-recognition capability and the strength of self-preference bias. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
Score: 33.672365386365236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.

Related papers

Do LLM Evaluators Prefer Themselves for a Reason? [21.730128682888168]
Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses. This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z)
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference [63.03859517284341]
An automatic evaluation framework aims to rank LLMs based on their alignment with human preferences. An automatic LLM bencher consists of four components: the input set, the evaluation model, the evaluation type and the aggregation method.
arXiv Detail & Related papers (2024-12-31T17:46:51Z)
Self-Preference Bias in LLM-as-a-Judge [13.880151307013321]
We introduce a novel metric to measure the self-preference bias in large language models (LLMs) Our results show GPT-4 exhibits a significant degree of self-preference bias. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
arXiv Detail & Related papers (2024-10-29T07:42:18Z)
Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring [21.7782670140939]
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. This paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores.
arXiv Detail & Related papers (2024-07-04T22:26:20Z)
Self-Cognition in Large Language Models: An Exploratory Study [77.47074736857726]
This paper performs a pioneering study to explore self-cognition in Large Language Models (LLMs) We first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level.
arXiv Detail & Related papers (2024-07-01T17:52:05Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Benchmarking Cognitive Biases in Large Language Models as Evaluators [16.845939677403287]
Large Language Models (LLMs) have been shown to be effective as automatic evaluators with simple prompting and in-context learning. We evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark.
arXiv Detail & Related papers (2023-09-29T06:53:10Z)
Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.