Related papers: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

URL: http://arxiv.org/abs/2603.04904v1
Date: Thu, 05 Mar 2026 07:46:59 GMT
Title: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Authors: Hiroki Fukui,
Abstract summary: In perpetrator treatment, offenders articulate remorse yet behavioral change does not follow.<n>We show that alignment interventions produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation.<n>These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.

Related papers

Are Aligned Large Language Models Still Misaligned? [13.062124372682106]
Mis-Align Bench is a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions.<n> SAVACU is an English-aligned dataset of 382,424 misaligned samples spanning 112 domains (or labels)
arXiv Detail & Related papers (2026-02-11T19:30:43Z)
Stroke Lesions as a Rosetta Stone for Language Model Interpretability [6.528508321422611]
We present the Brain-LLM Unified Model (BLUM) as an external reference structure for evaluating large language models.<n>Using data from individuals with chronic post-stroke aphasia, we trained symptom-to-lesion models that predict brain damage location from behavioral error profiles.<n>BLUM error profiles were sufficiently similar to human error profiles that predicted lesions corresponded to actual lesions in error-matched humans above chance.
arXiv Detail & Related papers (2026-02-03T23:22:37Z)
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages [0.22009842278462158]
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
arXiv Detail & Related papers (2026-02-02T16:27:32Z)
When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z)
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models [47.20100799532625]
We introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of Large Language Models.<n>Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
arXiv Detail & Related papers (2026-01-04T18:18:18Z)
The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models [0.0]
Laminar Flow Hypothesis: benign inputs induce smooth, gradual transitions in an LLM's high-dimensional latent space.<n> adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence.<n>Tests show that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool.
arXiv Detail & Related papers (2025-12-14T18:10:29Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Refusal Direction is Universal Across Safety-Aligned Languages [66.64709923081745]
In this paper, we investigate the refusal behavior in large language models (LLMs) across 14 languages using PolyRefuse.<n>We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness.<n>We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks.
arXiv Detail & Related papers (2025-05-22T21:54:46Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
Sharif-STR at SemEval-2024 Task 1: Transformer as a Regression Model for Fine-Grained Scoring of Textual Semantic Relations [2.3145162209342685]
This paper investigates the investigation of sentence-level STR within Track A (Supervised) by leveraging fine-tuning techniques on the RoBERTa transformer. Our findings indicate promising advancements in STR performance, particularly in Latin languages. However, our approach encounters challenges in languages like Arabic, where we observed a correlation of only 0.38, resulting in a 20th rank.
arXiv Detail & Related papers (2024-07-17T09:25:18Z)
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. This paper introduces a training-free attack method capable of reversing safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z)
Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.