Related papers: Normative Conflicts and Shallow AI Alignment

Normative Conflicts and Shallow AI Alignment

URL: http://arxiv.org/abs/2506.04679v1
Date: Thu, 05 Jun 2025 06:57:28 GMT
Title: Normative Conflicts and Shallow AI Alignment
Authors: Raphaël Millière,
Abstract summary: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment.<n>I argue that this vulnerability reflects a fundamental limitation of existing alignment methods.<n>I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

Related papers

ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z)
Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety [0.0]
This paper argues that the Method of Wide Reflective Equilibrium (MWRE) offers a uniquely apt framework for understanding current AI alignment efforts.<n>MWRE emphasizes the achievement of coherence between our considered moral judgments, guiding moral principles, and relevant background theories.<n>The paper demonstrates that MWRE serves as a valuable foundational for critically analyzing current alignment efforts and for guiding the future development of more ethically sound and justifiably aligned AI systems.
arXiv Detail & Related papers (2025-05-31T06:40:59Z)
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models [16.34270329099875]
We show that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in large language models' parametric memory.<n>In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs.<n>We empirically validate our findings by employing semantic coherence inducement under distributional shifts.
arXiv Detail & Related papers (2025-04-07T13:20:17Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models [91.24296813969003]
This paper advocates integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML.<n>We argue that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models.
arXiv Detail & Related papers (2025-02-28T14:57:33Z)
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives [52.863024096759816]
Misaligned research objectives have hindered progress in adversarial robustness research over the past decade.<n>We argue that realigned objectives are necessary for meaningful progress in adversarial alignment.
arXiv Detail & Related papers (2025-02-17T15:28:40Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.<n>This vulnerability poses significant risks to real-world applications.<n>We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Chat Bankman-Fried: an Exploration of LLM Alignment in Finance [4.892013668424246]
As jurisdictions enact legislation on AI safety, the concept of alignment must be defined and measured.<n>This paper proposes an experimental framework to assess whether large language models (LLMs) adhere to ethical and legal standards in the relatively unexplored context of finance.
arXiv Detail & Related papers (2024-11-01T08:56:17Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective [8.798946298425635]
Large Language Models (LLMs) are central to a multitude of applications but struggle with significant risks, notably in generating harmful content and biases. We argue that LLMs suffer a similar fundamental conflict, arising between their inherent desire for syntactic and semantic continuity, established during the pre-training phase, and the post-training alignment with human values. This conflict renders LLMs vulnerable to adversarial attacks, wherein intensifying the models' desire for continuity can circumvent alignment efforts, resulting in the generation of harmful information.
arXiv Detail & Related papers (2023-11-14T19:28:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.