Alignment for Honesty
- URL: http://arxiv.org/abs/2312.07000v1
- Date: Tue, 12 Dec 2023 06:10:42 GMT
- Title: Alignment for Honesty
- Authors: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu
- Abstract summary: We argue for the importance of alignment for honesty, ensuring that language models proactively refuse to answer questions when they lack knowledge.
This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training.
We introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty.
- Score: 113.42626737461129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has made significant strides in applying alignment techniques
to enhance the helpfulness and harmlessness of large language models (LLMs) in
accordance with human intentions. In this paper, we argue for the importance of
alignment for honesty, ensuring that LLMs proactively refuse to answer
questions when they lack knowledge, while still not being overly conservative.
However, a pivotal aspect of alignment for honesty involves discerning the
limits of an LLM's knowledge, which is far from straightforward. This challenge
demands comprehensive solutions in terms of metric development, benchmark
creation, and training methodologies. In this paper, we address these
challenges by first establishing a precise problem definition and defining
``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone
for developing metrics that effectively measure an LLM's honesty by quantifying
its progress post-alignment. Furthermore, we introduce a flexible training
framework which is further instantiated by several efficient fine-tuning
techniques that emphasize honesty without sacrificing performance on other
tasks. Our extensive experiments reveal that these aligned models show a marked
increase in honesty, as indicated by our proposed metrics. We open-source a
wealth of resources to facilitate future research at
https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned
models, training and evaluation datasets for honesty alignment, concept
glossary, as well as all relevant source code.
Related papers
- BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models.
BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses.
Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z) - Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions [17.758735680493917]
We develop test-time strategies to improve Frontier Large Language Models' trustworthiness.
We leverage causality as a tool to formally encode two aspects of trustworthiness in LLMs: fairness and robustness.
We show that out-of-context prompting consistently improves the fairness and robustness of frontier LLMs.
arXiv Detail & Related papers (2024-06-11T20:05:15Z) - Understanding the Learning Dynamics of Alignment with Human Feedback [17.420727709895736]
This paper provides an attempt to theoretically analyze the learning dynamics of human preference alignment.
We show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy.
arXiv Detail & Related papers (2024-03-27T16:39:28Z) - SaGE: Evaluating Moral Consistency in Large Language Models [15.079905222871071]
We show that even state-of-the-art Large Language Models are morally inconsistent in their generations.
We propose an information-theoretic measure called Semantic Graph Entropy (SaGE) to measure a model's moral consistency.
arXiv Detail & Related papers (2024-02-21T11:23:21Z) - Best Practices for Text Annotation with Large Language Models [11.421942894219901]
Large Language Models (LLMs) have ushered in a new era of text annotation.
This paper proposes a comprehensive set of standards and best practices for their reliable, reproducible, and ethical use.
arXiv Detail & Related papers (2024-02-05T15:43:50Z) - DeAL: Decoding-time Alignment for Large Language Models [59.63643988872571]
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences.
We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs.
Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
arXiv Detail & Related papers (2024-02-05T06:12:29Z) - Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks.
We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts.
Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks.
We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.