The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
- URL: http://arxiv.org/abs/2504.02708v1
- Date: Thu, 03 Apr 2025 15:46:46 GMT
- Title: The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
- Authors: Nikhil Verma, Manasa Bharadwaj,
- Abstract summary: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations.<n>Despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages.<n>Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalizes to multilingual settings.
- Score: 0.9130277390156759
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.
Related papers
- MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.
This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited.
We propose an approach to build a multilingual guardrail with reasoning.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs [12.334510055293535]
Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities.
We introduce cross-lingual alignment metrics to quantify the alignment at an instance level for discriminative tasks.
We find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions.
arXiv Detail & Related papers (2025-04-13T00:01:22Z) - High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs.<n>We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions.<n>We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z) - Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment [4.368725325557961]
Soteria locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language.<n>XThreatBench is a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines.<n> Experiments with leading open-source LLMs show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages.
arXiv Detail & Related papers (2025-02-16T19:44:01Z) - ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework [78.07201802874529]
ShifCon is a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one.<n>It shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters.<n>Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages.
arXiv Detail & Related papers (2024-10-25T10:28:59Z) - Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention [71.12193680015622]
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing.
LLMs exhibit significant performance gaps among different languages.
We propose Inference-Time Cross-Lingual Intervention (INCLINE) to overcome these limitations without incurring significant costs.
arXiv Detail & Related papers (2024-10-16T11:23:03Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - LLM for Everyone: Representing the Underrepresented in Large Language Models [21.07409393578553]
This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages.
A comprehensive evaluation of large language models (LLMs) is conducted to assess their capabilities in these languages.
The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment.
arXiv Detail & Related papers (2024-09-20T20:53:22Z) - Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture [6.17896401271963]
We introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various large language models.
We investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending.
arXiv Detail & Related papers (2024-07-10T03:26:15Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages? [34.38469832305664]
This paper focuses on human values-related concepts (i.e., value concepts) due to their significance for AI safety.
We first empirically confirm the presence of value concepts within LLMs in a multilingual format.
Further analysis on the cross-lingual characteristics of these concepts reveals 3 traits arising from language resource disparities.
arXiv Detail & Related papers (2024-02-28T07:18:39Z) - The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts [46.089025223336854]
This paper examines the variations in safety challenges faced by large language models across different languages.
We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
arXiv Detail & Related papers (2024-01-23T23:12:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.