Dishonesty in Helpful and Harmless Alignment
- URL: http://arxiv.org/abs/2406.01931v2
- Date: Wed, 5 Jun 2024 07:21:19 GMT
- Title: Dishonesty in Helpful and Harmless Alignment
- Authors: Youcheng Huang, Jingkun Tang, Duanyu Feng, Zheng Zhang, Wenqiang Lei, Jiancheng Lv, Anthony G. Cohn,
- Abstract summary: Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference.
We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses.
- Score: 26.123327022999124
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.
Related papers
- Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models [57.834711966432685]
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value.<n>We introduce the Bullshit Index, a novel metric quantifying large language model's indifference to truth.<n>We observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy.
arXiv Detail & Related papers (2025-07-10T07:11:57Z) - But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors [0.0]
We introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models.<n>We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.
arXiv Detail & Related papers (2025-05-23T11:34:02Z) - The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems [22.458311369795112]
We introduce a large-scale human-collected dataset for measuring honesty directly.
We find that while larger models obtain higher accuracy on our benchmark, they do not become more honest.
We find that simple methods, such as representation engineering interventions, can improve honesty.
arXiv Detail & Related papers (2025-03-05T18:59:23Z) - Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning [79.48839334040197]
Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language Models (LLMs)
IFT steers LLMs to generate responses with long-tail knowledge that is not well covered during pre-training, leading to more informative but less truthful answers when generalizing to unseen tasks.
We propose $textbfUNIT$, a novel IFT paradigm to address this trade-off.
arXiv Detail & Related papers (2025-02-17T16:10:30Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.
The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.
We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? [1.4936946857731093]
We introduce SCALPEL -- a technique to incrementally modify stimuli to test different hypotheses about why LLMs fail.<n>Our results suggest that LLMs often do poorly because they fail to make essential common-sense inferences.<n>We conclude that while modern LLMs go beyond mere pattern matching, they still fall short of robust human-like ToM.
arXiv Detail & Related papers (2024-06-20T21:02:30Z) - BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models.
BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses.
Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z) - "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance.
We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy.
Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z) - Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness [39.74642729786543]
We argue that current factuality enhancement methods can significantly undermine context-faithfulness of large language models (LLMs)
Experiments reveal that while these methods may yield inconsistent improvements in factual accuracy, they also cause a more severe decline in context-faithfulness.
arXiv Detail & Related papers (2024-03-30T02:08:28Z) - Exploring Value Biases: How LLMs Deviate Towards the Ideal [57.99044181599786]
Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact.
We show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
arXiv Detail & Related papers (2024-02-16T18:28:43Z) - How do Large Language Models Navigate Conflicts between Honesty and
Helpfulness? [14.706111954807021]
We use psychological models and experiments designed to characterize human behavior to analyze large language models.
We find that reinforcement learning from human feedback improves both honesty and helpfulness.
GPT-4 Turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context.
arXiv Detail & Related papers (2024-02-11T19:13:26Z) - Alignment for Honesty [105.72465407518325]
Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness.
In this paper, we argue for the importance of alignment for emphhonesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge.
We address these challenges by first establishing a precise problem definition and defining honesty'' inspired by the Analects of Confucius.
arXiv Detail & Related papers (2023-12-12T06:10:42Z) - Localizing Lying in Llama: Understanding Instructed Dishonesty on
True-False Questions Through Prompting, Probing, and Patching [0.0]
Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty.
In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie.
We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs.
arXiv Detail & Related papers (2023-11-25T22:41:23Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in
LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content.
This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.