Related papers: Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

URL: http://arxiv.org/abs/2311.04155v3
Date: Fri, 21 Jun 2024 06:06:07 GMT
Title: Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Authors: Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang,
Abstract summary: Large language models (LLMs) have shown impressive success in various applications. LLMs are often not well aligned with human intents, which calls for additional treatments on them. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments.
Score: 95.73262836039231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods primarily focus on further training them. However, the extra training of LLMs is usually expensive in terms of GPU computing; even worse, some LLMs are not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments. The idea is to optimize user prompts to suit LLMs' input understanding, so as to best realize users' intents without updating LLMs' parameters. BPO leverages human preferences to optimize prompts, thus making it superior to LLM (e.g., ChatGPT) as a prompt engineer. Moreover, BPO is model-agnostic, and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version and 10% for GPT-4. Notably, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.

Related papers

Rethinking Prompt Optimizers: From Prompt Merits to Optimization [14.01541576309104]
We introduce MePO, a merit-guided, lightweight, locally deployable prompt training dataset built from merit-aligned prompts.<n>MePO avoids online optimization, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models.
arXiv Detail & Related papers (2025-05-15T03:31:37Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework. We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback [40.01227095901647]
Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining.<n>We introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference.<n>Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.
arXiv Detail & Related papers (2025-01-22T14:15:46Z)
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings [40.605411087380226]
We introduce FocalPO, a DPO variant that prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss.
arXiv Detail & Related papers (2025-01-11T21:41:27Z)
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning [18.763247227949822]
Large language models (LLMs) excel in natural language generation using zero-shot and few-shot prompting. encoder-only models like BERT-base outperform LLMs on benchmarks like GLUE and SuperGLUE. This paper explores two approaches-supervised fine-tuning (SFT) and proximal policy optimization (PPO)-to enhance LLMs' NLU abilities.
arXiv Detail & Related papers (2024-10-14T19:16:56Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance.
arXiv Detail & Related papers (2024-06-14T06:57:18Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference. We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models. We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [16.99550556866219]
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. In academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO) We show that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
arXiv Detail & Related papers (2024-04-16T16:51:53Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
LiPO: Listwise Preference Optimization through Learning-to-Rank [62.02782819559389]
Policy can learn more effectively from a ranked list of plausible responses given the prompt. We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks.
arXiv Detail & Related papers (2024-02-02T20:08:10Z)
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs) Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.