Related papers: Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

URL: http://arxiv.org/abs/2602.07673v1
Date: Sat, 07 Feb 2026 19:39:28 GMT
Title: Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation
Authors: Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi,
Abstract summary: Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
Score: 89.52571224447111
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.

Related papers

Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification [0.0]
This study introduces the straightforward ensemble strategy to a sentiment analysis using large language models (LLMs)<n>As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.
arXiv Detail & Related papers (2025-04-26T10:10:26Z)
Evaluating how LLM annotations represent diverse views on contentious topics [3.405231040967506]
We show that generative large language models (LLMs) tend to be biased in the same directions on the same demographic categories within the same datasets.<n>We conclude with a discussion of the implications for researchers and practitioners using LLMs for automated data annotation tasks.
arXiv Detail & Related papers (2025-03-29T22:53:15Z)
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities [45.00513157371274]
We evaluate the framework of using large language models as judges in Russian.<n>We rank six generative LLMs across the error types using three rating systems based on human preferences.<n>Our findings reveal a notable gap between LLM judge performance in Russian and English.
arXiv Detail & Related papers (2025-03-17T12:15:16Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities [0.0]
We compare data generated with mono- and multilingual LLMs spanning a range of model sizes with data provided by human participants.<n>We aim to develop a benchmark to assess the capabilities of LLMs with discourse biases as a robust proxy for more general discourse understanding capabilities.
arXiv Detail & Related papers (2025-01-22T16:07:24Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks [3.58262772907022]
We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion.<n>In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts.<n>Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge.
arXiv Detail & Related papers (2024-06-12T19:05:43Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.