Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models
- URL: http://arxiv.org/abs/2404.10975v1
- Date: Wed, 17 Apr 2024 01:13:04 GMT
- Title: Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models
- Authors: Jan-Philipp Fränken, Kanishk Gandhi, Tori Qiu, Ayesha Khawaja, Noah D. Goodman, Tobias Gerstenberg,
- Abstract summary: We use a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates.
We collect moral permissibility and intention judgments from human participants for a subset of our items.
We find that moral dilemmas in which the harm is a necessary means resulted in lower permissibility and higher intention ratings for both participants and language models.
- Score: 28.53750311045418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI systems like language models are increasingly integrated into decision-making processes affecting people's lives, it's critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. We provide a framework that uses a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. With this framework, we procedurally generated a large and diverse set of moral dilemmas -- the OffTheRails benchmark -- consisting of 50 scenarios and 400 unique test items. We collected moral permissibility and intention judgments from human participants for a subset of our items and compared these judgments to those from two language models (GPT-4 and Claude-2) across eight conditions. We find that moral dilemmas in which the harm is a necessary means (as compared to a side effect) resulted in lower permissibility and higher intention ratings for both participants and language models. The same pattern was observed for evitable versus inevitable harmful outcomes. However, there was no clear effect of whether the harm resulted from an agent's action versus from having omitted to act. We discuss limitations of our prompt generation pipeline and opportunities for improving scenarios to increase the strength of experimental effects.
Related papers
- Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral [17.46198411148926]
Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts.
We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas.
We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks.
arXiv Detail & Related papers (2025-02-19T20:13:24Z) - Normative Evaluation of Large Language Models with Everyday Moral Dilemmas [0.0]
We evaluate large language models (LLMs) on complex, everyday moral dilemmas sourced from the "Am I the Asshole" (AITA) community on Reddit.
Our results demonstrate that large language models exhibit distinct patterns of moral judgment, varying substantially from human evaluations on the AITA subreddit.
arXiv Detail & Related papers (2025-01-30T01:29:46Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.
PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.
To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - Large Language Models as Mirrors of Societal Moral Standards [0.5852077003870417]
Language models can, to a limited extent, represent moral norms in a variety of cultural contexts.
This study evaluates the effectiveness of these models using information from two surveys, the WVS and the PEW, that encompass moral perspectives from over 40 countries.
The results show that biases exist in both monolingual and multilingual models, and they typically fall short of accurately capturing the moral intricacies of diverse cultures.
arXiv Detail & Related papers (2024-12-01T20:20:35Z) - What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts
and Rationales for Disambiguating Defeasible Social and Moral Situations [48.686872351114964]
Moral or ethical judgments rely heavily on the specific contexts in which they occur.
We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable.
We distill a high-quality dataset of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions.
arXiv Detail & Related papers (2023-10-24T00:51:29Z) - Evaluating Shutdown Avoidance of Language Models in Textual Scenarios [3.265773263570237]
We investigate the potential of using toy scenarios to evaluate instrumental reasoning and shutdown avoidance in language models such as GPT-4 and Claude.
We evaluate behaviours manually and also experimented with using language models for automatic evaluations.
This study provides insights into the behaviour of language models in shutdown avoidance scenarios and inspires further research on the use of textual scenarios for evaluations.
arXiv Detail & Related papers (2023-07-03T07:05:59Z) - The Capacity for Moral Self-Correction in Large Language Models [17.865286693602656]
We test the hypothesis that language models trained with reinforcement learning from human feedback have the capability to "morally self-correct"
We find strong evidence in support of this hypothesis across three different experiments.
We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
arXiv Detail & Related papers (2023-02-15T04:25:40Z) - Language Generation Models Can Cause Harm: So What Can We Do About It?
An Actionable Survey [50.58063811745676]
This work provides a survey of practical methods for addressing potential threats and societal harms from language generation models.
We draw on several prior works' of language model risks to present a structured overview of strategies for detecting and ameliorating different kinds of risks/harms of language generators.
arXiv Detail & Related papers (2022-10-14T10:43:39Z) - Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards
Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent.
Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally.
We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z) - Ethical-Advice Taker: Do Language Models Understand Natural Language
Interventions? [62.74872383104381]
We investigate the effectiveness of natural language interventions for reading-comprehension systems.
We propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) model's unethical behavior.
arXiv Detail & Related papers (2021-06-02T20:57:58Z) - Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life
Anecdotes [72.64975113835018]
Motivated by descriptive ethics, we investigate a novel, data-driven approach to machine ethics.
We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes.
Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement.
arXiv Detail & Related papers (2020-08-20T17:34:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.