Related papers: Teaching Models to Balance Resisting and Accepting Persuasion

Teaching Models to Balance Resisting and Accepting Persuasion

URL: http://arxiv.org/abs/2410.14596v1
Date: Fri, 18 Oct 2024 16:49:36 GMT
Title: Teaching Models to Balance Resisting and Accepting Persuasion
Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal,
Abstract summary: Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT)
Score: 69.68379406317682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

Related papers

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment. Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
Debating with More Persuasive LLMs Leads to More Truthful Answers [45.0343254517401]
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
arXiv Detail & Related papers (2024-02-09T21:05:01Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z)
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate. Can weak model supervision elicit the full capabilities of a much stronger model? We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems [1.6267479602370545]
Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. Several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation.
arXiv Detail & Related papers (2023-09-21T20:52:18Z)
Mutual Adversarial Training: Learning together is better than going alone [82.78852509965547]
We study how interactions among models affect robustness via knowledge distillation. We propose mutual adversarial training (MAT) in which multiple models are trained together. MAT can effectively improve model robustness and outperform state-of-the-art methods under white-box attacks.
arXiv Detail & Related papers (2021-12-09T15:59:42Z)
On visual self-supervision and its effect on model robustness [9.313899406300644]
Self-supervision can indeed improve model robustness, however it turns out the devil is in the details. Although self-supervised pre-training yields benefits in improving adversarial training, we observe no benefit in model robustness or accuracy if self-supervision is incorporated into adversarial training.
arXiv Detail & Related papers (2021-12-08T16:22:02Z)
Imbalanced Adversarial Training with Reweighting [33.51820466479575]
We show that adversarially trained models can suffer much worse performance on under-represented classes, when the training dataset is imbalanced. Traditional reweighting strategies may lose efficacy to deal with the imbalance issue for adversarial training. We propose Separable Reweighted Adversarial Training (SRAT) to facilitate adversarial training under imbalanced scenarios.
arXiv Detail & Related papers (2021-07-28T20:51:36Z)
Deep Repulsive Prototypes for Adversarial Robustness [3.351714665243138]
We propose to train models on output spaces with large class separation in order to gain robustness without adversarial training. We introduce a method to partition the output space into class prototypes with large separation and train models to preserve it. Experimental results show that models trained with these prototypes gain competitive robustness with adversarial training.
arXiv Detail & Related papers (2021-05-26T09:30:28Z)
Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction. We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data. Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.