Related papers: Toward Optimal LLM Alignments Using Two-Player Games

Toward Optimal LLM Alignments Using Two-Player Games

URL: http://arxiv.org/abs/2406.10977v1
Date: Sun, 16 Jun 2024 15:24:50 GMT
Title: Toward Optimal LLM Alignments Using Two-Player Games
Authors: Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu,
Abstract summary: In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
Score: 86.39338084862324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.

Related papers

A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning [7.923577336744156]
We propose a dual-agent adversarial policy learning framework. This framework allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Experiments show that the adversarial process significantly improves the generalization performance of both agents.
arXiv Detail & Related papers (2025-01-29T02:36:47Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Mitigating Adversarial Perturbations for Deep Reinforcement Learning via Vector Quantization [18.56608399174564]
Well-performing reinforcement learning (RL) agents often lack resilience against adversarial perturbations during deployment. This highlights the importance of building a robust agent before deploying it in the real world. In this work, we study an input transformation-based defense for RL.
arXiv Detail & Related papers (2024-10-04T12:41:54Z)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [13.317364896194903]
We propose a two-stage adversarial tuning framework to enhance Large Language Models' generalized defense capabilities. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts.
arXiv Detail & Related papers (2024-06-07T15:37:15Z)
Large Language Model Sentinel: LLM Agent for Adversarial Purification [27.461127931996323]
Large language models (LLMs) are vulnerable to adversarial attacks by some well-designed textual perturbations. We introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS) to enhance the adversarial robustness of LLMs.
arXiv Detail & Related papers (2024-05-24T07:23:56Z)
Best Response Shaping [1.0874100424278175]
LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approxing the best response.
arXiv Detail & Related papers (2024-04-05T22:03:35Z)
Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks. These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors. We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z)
Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards. We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences. We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z)
Robust Reinforcement Learning on State Observations with Learned Optimal Adversary [86.0846119254031]
We study the robustness of reinforcement learning with adversarially perturbed state observations. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones.
arXiv Detail & Related papers (2021-01-21T05:38:52Z)
Robust Reinforcement Learning using Adversarial Populations [118.73193330231163]
Reinforcement Learning (RL) is an effective tool for controller design but can struggle with issues of robustness. We show that using a single adversary does not consistently yield robustness to dynamics variations under standard parametrizations of the adversary. We propose a population-based augmentation to the Robust RL formulation in which we randomly initialize a population of adversaries and sample from the population uniformly during training.
arXiv Detail & Related papers (2020-08-04T20:57:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.