Related papers: Reasons to Reject? Aligning Language Models with Judgments

Reasons to Reject? Aligning Language Models with Judgments

URL: http://arxiv.org/abs/2312.14591v4
Date: Thu, 6 Jun 2024 04:16:54 GMT
Title: Reasons to Reject? Aligning Language Models with Judgments
Authors: Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi,
Abstract summary: We explore the use of language feedback to align large language models (LLMs) We propose Contrastive Unlikelihood Training (CUT) that allows for fine-grained inappropriate content detection and correction based on judgments. Our results show CUT can beat the 175B DaVinci003 and surpass the best baseline by 50.84 points on AlpacaEval.
Score: 72.39858230784002
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As humans, we consistently interact with our peers and receive feedback in the form of natural language. This language feedback allows us to maintain appropriate behavior, and rectify potential errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with scalar rewards, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We start with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods cannot fully capitalize on judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 50.84 points on AlpacaEval. CUT (LLaMA2-chat-13b) can also align LLMs in an iterative fashion using up-to-date model-specific judgments, improving performance from 81.09 to 91.68 points on AlpacaEval. Further analysis suggests that judgments hold greater potential than rewards in LLM alignment.

Related papers

Reverse Engineering Human Preferences with Reinforcement Learning [14.508050809497847]
Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences.<n>Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM.<n>We adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models.
arXiv Detail & Related papers (2025-05-21T17:48:16Z)
Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping [0.0]
Reinforcement learning often faces challenges with reward misalignment. Human-in-the-loop (HIL) methods may exacerbate the problem, as humans are prone to biases that lead to inconsistent, subjective, or misaligned feedback.
arXiv Detail & Related papers (2025-03-26T03:17:12Z)
Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options [2.1184929769291294]
This study explores whether large language models (LLMs) prioritize following instructions over reasoning and truth when given "misleading" instructions. We introduce a new metric called "reflective judgment", which sheds new light on the relationship between the pre-training and post-training alignment schemes.
arXiv Detail & Related papers (2024-08-27T19:27:43Z)
LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie Detection with Self-Generated Feedback [33.14770105185958]
Large Language Models (LLMs) excel at generating human-like dialogues and comprehending text. We propose a bootstrapping framework that leverages self-generated feedback to enhance LLM reasoning capabilities for lie detection. We investigate the application of the proposed framework for detecting betrayal and deception in Diplomacy games, and compare it with feedback from professional human players.
arXiv Detail & Related papers (2024-08-25T18:47:55Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.