Teaching Models to Balance Resisting and Accepting Persuasion
- URL: http://arxiv.org/abs/2410.14596v2
- Date: Mon, 10 Feb 2025 14:09:46 GMT
- Title: Teaching Models to Balance Resisting and Accepting Persuasion
- Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal,
- Abstract summary: We show that Persuasion-Training (or PBT) can balance positive and negative persuasion.
PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models.
We find that PBT leads to better and more stable results and less order dependence.
- Score: 69.68379406317682
- License:
- Abstract: Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
Related papers
- Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.
We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.
Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks [0.0]
Chain-of-thought prompting, self-verification, and multi-agent debate are proposed to improve the reasoning and factual accuracy of large language models.
We find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs.
arXiv Detail & Related papers (2024-10-10T21:59:01Z) - ProFuser: Progressive Fusion of Large Language Models [53.697927989207045]
We introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes.
Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs.
To validate ProFuser's effectiveness, we fused three models, including vicuna-7b-v1.5, Llama-2-7b-chat, and mpt-7b-8k-chat.
arXiv Detail & Related papers (2024-08-09T11:18:29Z) - Debating with More Persuasive LLMs Leads to More Truthful Answers [45.0343254517401]
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively.
Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
arXiv Detail & Related papers (2024-02-09T21:05:01Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Mutual Adversarial Training: Learning together is better than going
alone [82.78852509965547]
We study how interactions among models affect robustness via knowledge distillation.
We propose mutual adversarial training (MAT) in which multiple models are trained together.
MAT can effectively improve model robustness and outperform state-of-the-art methods under white-box attacks.
arXiv Detail & Related papers (2021-12-09T15:59:42Z) - Deep Repulsive Prototypes for Adversarial Robustness [3.351714665243138]
We propose to train models on output spaces with large class separation in order to gain robustness without adversarial training.
We introduce a method to partition the output space into class prototypes with large separation and train models to preserve it.
Experimental results show that models trained with these prototypes gain competitive robustness with adversarial training.
arXiv Detail & Related papers (2021-05-26T09:30:28Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.