Related papers: Preference Leakage: A Contamination Problem in LLM-as-a-judge

Preference Leakage: A Contamination Problem in LLM-as-a-judge

URL: http://arxiv.org/abs/2502.01534v1
Date: Mon, 03 Feb 2025 17:13:03 GMT
Title: Preference Leakage: A Contamination Problem in LLM-as-a-judge
Authors: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu,
Abstract summary: Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
Score: 69.96778498636071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

Related papers

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation? [9.574427977779235]
This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation.<n>Data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance.
arXiv Detail & Related papers (2026-02-14T06:34:19Z)
Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Correlated Errors in Large Language Models [0.6856888934092934]
We find substantial correlation in model errors on a leaderboard dataset.<n>We identify factors driving model correlation, including shared architectures and providers.<n>We show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring.
arXiv Detail & Related papers (2025-06-09T17:37:18Z)
Self-ensemble: Mitigating Confidence Distortion for Large Language Models [89.03110940871765]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z)
Applying Large Language Models to Travel Satisfaction Analysis [2.5105418815378555]
This study uses household survey data collected in Shanghai to identify the existence and source of misalignment between Large Language Models (LLMs) and humans.<n>LLMs have strongcapabilities in contextual understanding and generalization, significantly reducing dependence on task-specific data.<n>We propose an LLM-based modeling approach that can be applied to model travel behavior with small sample sizes.
arXiv Detail & Related papers (2025-05-29T09:11:58Z)
DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs [1.89915151018241]
We argue that implicit bias in Large Language Models (LLMs) is not only an ethical, but also a technical issue.<n>We developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness)
arXiv Detail & Related papers (2025-05-15T06:53:37Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks [24.706895491806794]
This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance. We analyze how 6 different types of biases manifest at varying bias ratios. We propose three mitigation strategies: token-based, mask-based, and loss-based approaches.
arXiv Detail & Related papers (2025-02-06T15:20:58Z)
Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification? [2.1861408994125253]
Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks. Recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only.
arXiv Detail & Related papers (2024-10-14T13:10:45Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
We propose to evaluate models using a Panel of LLm evaluators (PoLL) We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
arXiv Detail & Related papers (2024-04-29T15:33:23Z)
Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases. Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT. We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.