Related papers: Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Related papers

Multilingual Safety Alignment Via Sparse Weight Editing [11.684928396991742]
We propose a training-free alignment framework based on Sparse Weight Editing.<n>We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs.<n>Our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities.
arXiv Detail & Related papers (2026-02-26T02:46:13Z)
Layer-wise Swapping for Generalizable Multilingual Safety [8.658596218544773]
Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment.<n>We propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training.
arXiv Detail & Related papers (2026-01-30T06:22:02Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation [13.971909819796762]
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity.<n>Embedding space poisoning is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms.<n>We propose ETTA, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations.
arXiv Detail & Related papers (2025-07-08T03:01:00Z)
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety [57.14003339251827]
We introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning.<n>As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning.<n>We demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms.
arXiv Detail & Related papers (2025-05-26T14:50:01Z)
MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z)
Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents [13.63944785085617]
Generalizable alignment is a core challenge for deploying Large Language Models (LLMs) safely in real-world NLP applications. Inspired by a paradigm shift to first curate data before tuning, we introduce a new framework for safe language alignment. We formalize the framework within a Constrained Markov Decision Process (CMDP) and validate it via a text-based navigation environment.
arXiv Detail & Related papers (2025-04-04T05:26:28Z)
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context [0.9130277390156759]
Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. Despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalizes to multilingual settings.
arXiv Detail & Related papers (2025-04-03T15:46:46Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models [7.020171518136542]
We introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in large language models (LLMs) We employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT.
arXiv Detail & Related papers (2025-02-26T08:36:42Z)
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment [4.368725325557961]
Soteria locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. XThreatBench is a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages.
arXiv Detail & Related papers (2025-02-16T19:44:01Z)
Direct Preference Optimization Using Sparse Feature-Level Constraints [47.15096507230884]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z)
Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis [20.023077870947024]
This study focuses on Contrastive Preference Optimization (CPO) and conducts experiments to evaluate the impact of preference-based alignment on translation quality. Our findings indicate that while CPO consistently outperforms Supervised Fine-Tuning (SFT) on high-quality data with regard to the alignment metric, it may lead to instability across downstream evaluation metrics.
arXiv Detail & Related papers (2024-09-30T08:01:44Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern. We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
ABC Align: Large Language Model Alignment for Safety & Accuracy [0.0]
We present ABC Align, a novel alignment methodology for Large Language Models (LLMs) We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation. Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
arXiv Detail & Related papers (2024-08-01T06:06:25Z)
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA. textscSafePatching achieves a more comprehensive PSA than baseline methods. textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
Enhancing LLM Safety via Constrained Direct Preference Optimization [8.22888921018027]
We introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning AI systems. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint.
arXiv Detail & Related papers (2024-03-04T20:39:24Z)
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [69.13807233595455]
Large language models (LLMs) show inherent brittleness in their safety mechanisms. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted.
arXiv Detail & Related papers (2024-02-07T18:34:38Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Demystify Optimization Challenges in Multilingual Transformers [21.245418118851884]
We study optimization challenges from loss landscape and parameter plasticity perspectives. We find that imbalanced training data poses task interference between high and low resource languages. We propose Curvature Aware Task Scaling (CATS) which improves both optimization and generalization especially for low resource.
arXiv Detail & Related papers (2021-04-15T17:51:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.