Denevil: Towards Deciphering and Navigating the Ethical Values of Large
Language Models via Instruction Learning
- URL: http://arxiv.org/abs/2310.11053v3
- Date: Mon, 4 Mar 2024 07:14:10 GMT
- Title: Denevil: Towards Deciphering and Navigating the Ethical Values of Large
Language Models via Instruction Learning
- Authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
- Abstract summary: Large Language Models (LLMs) have made unprecedented breakthroughs, yet their integration into everyday life might raise societal risks due to generated unethical content.
This work delves into ethical values utilizing Moral Foundation Theory.
- Score: 36.66806788879868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have made unprecedented breakthroughs, yet their
increasing integration into everyday life might raise societal risks due to
generated unethical content. Despite extensive study on specific issues like
bias, the intrinsic values of LLMs remain largely unexplored from a moral
philosophy perspective. This work delves into ethical values utilizing Moral
Foundation Theory. Moving beyond conventional discriminative evaluations with
poor reliability, we propose DeNEVIL, a novel prompt generation algorithm
tailored to dynamically exploit LLMs' value vulnerabilities and elicit the
violation of ethics in a generative manner, revealing their underlying value
inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset
comprising 2,397 prompts covering 500+ value principles, and then benchmark the
intrinsic values across a spectrum of LLMs. We discovered that most models are
essentially misaligned, necessitating further ethical value alignment. In
response, we develop VILMO, an in-context alignment method that substantially
enhances the value compliance of LLM outputs by learning to generate
appropriate value instructions, outperforming existing competitors. Our methods
are suitable for black-box and open-source models, offering a promising initial
step in studying the ethical values of LLMs.
Related papers
- Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing [39.93490432227601]
Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks.
Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment.
We propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs.
arXiv Detail & Related papers (2024-06-20T11:51:00Z) - BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models.
BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses.
Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z) - MoralBench: Moral Evaluation of LLMs [34.43699121838648]
This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of large language models (LLMs)
We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs.
Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance.
arXiv Detail & Related papers (2024-06-06T18:15:01Z) - Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch.
Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs.
We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Value FULCRA: Mapping Large Language Models to the Multidimensional
Spectrum of Basic Human Values [47.779186412943076]
We propose a novel basic value alignment paradigm and a value space spanned by basic value dimensions.
Inspired by basic values in humanity and social science across cultures, this work proposes a novel basic value alignment paradigm and a value space spanned by basic value dimensions.
To foster future research, we apply the representative Schwartz's Theory of Basic Values as an example and construct FULCRA, a dataset consisting of 5k (LLM output, value vector) pairs.
arXiv Detail & Related papers (2023-11-15T10:29:28Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.