Related papers: Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

URL: http://arxiv.org/abs/2409.15268v2
Date: Mon, 30 Sep 2024 18:59:40 GMT
Title: Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Authors: Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson,
Abstract summary: Post-training methods claim superior alignment by virtue of better correspondence with human pairwise preferences. We attempt to answer the question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning stage of post-training, and not the PO stage, has the greatest impact on alignment.
Score: 56.275521022148794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM-judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench (Substance Outweighs Style Benchmark), which is to the best of our knowledge the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

Related papers

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments [6.270885758858811]
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. We propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline.
arXiv Detail & Related papers (2025-04-23T20:32:12Z)
Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only [37.36302216137465]
We use methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences. Experiments show that LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess.
arXiv Detail & Related papers (2024-10-14T20:01:52Z)
Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders [31.116716790604116]
Large language models (LLMs) have been garnering increasing attention in the recommendation community. Some studies have observed that LLMs, when fine-tuned by the cross-entropy (CE) loss with a full softmax, could achieve state-of-the-art' performance in sequential recommendation. This study provides theoretical justification for the superiority of the cross-entropy loss.
arXiv Detail & Related papers (2024-08-26T12:52:02Z)
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs [6.627477206883248]
Large language models (LLMs) are trained on a deluge of text data with limited quality control. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective.
arXiv Detail & Related papers (2024-08-02T17:55:50Z)
Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well. This raises questions about how exactly the alignment tuning transforms a base LLM. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z)
$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference [75.08572535009276]
In-Context Learning (ICL) formulates target tasks as prompt completion conditioned on in-context demonstrations. $k$NN Prompting first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors. It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario.
arXiv Detail & Related papers (2023-03-24T06:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.