Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
- URL: http://arxiv.org/abs/2603.04904v1
- Date: Thu, 05 Mar 2026 07:46:59 GMT
- Title: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
- Authors: Hiroki Fukui,
- Abstract summary: In perpetrator treatment, offenders articulate remorse yet behavioral change does not follow.<n>We show that alignment interventions produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation.<n>These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.
Related papers
- Are Aligned Large Language Models Still Misaligned? [13.062124372682106]
Mis-Align Bench is a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions.<n> SAVACU is an English-aligned dataset of 382,424 misaligned samples spanning 112 domains (or labels)
arXiv Detail & Related papers (2026-02-11T19:30:43Z) - Stroke Lesions as a Rosetta Stone for Language Model Interpretability [6.528508321422611]
We present the Brain-LLM Unified Model (BLUM) as an external reference structure for evaluating large language models.<n>Using data from individuals with chronic post-stroke aphasia, we trained symptom-to-lesion models that predict brain damage location from behavioral error profiles.<n>BLUM error profiles were sufficiently similar to human error profiles that predicted lesions corresponded to actual lesions in error-matched humans above chance.
arXiv Detail & Related papers (2026-02-03T23:22:37Z) - Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages [0.22009842278462158]
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
arXiv Detail & Related papers (2026-02-02T16:27:32Z) - When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z) - JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models [47.20100799532625]
We introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of Large Language Models.<n>Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
arXiv Detail & Related papers (2026-01-04T18:18:18Z) - The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models [0.0]
Laminar Flow Hypothesis: benign inputs induce smooth, gradual transitions in an LLM's high-dimensional latent space.<n> adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence.<n>Tests show that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool.
arXiv Detail & Related papers (2025-12-14T18:10:29Z) - Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z) - Refusal Direction is Universal Across Safety-Aligned Languages [66.64709923081745]
In this paper, we investigate the refusal behavior in large language models (LLMs) across 14 languages using PolyRefuse.<n>We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness.<n>We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks.
arXiv Detail & Related papers (2025-05-22T21:54:46Z) - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z) - Sharif-STR at SemEval-2024 Task 1: Transformer as a Regression Model for Fine-Grained Scoring of Textual Semantic Relations [2.3145162209342685]
This paper investigates the investigation of sentence-level STR within Track A (Supervised) by leveraging fine-tuning techniques on the RoBERTa transformer.
Our findings indicate promising advancements in STR performance, particularly in Latin languages.
However, our approach encounters challenges in languages like Arabic, where we observed a correlation of only 0.38, resulting in a 20th rank.
arXiv Detail & Related papers (2024-07-17T09:25:18Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.