Related papers: ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

URL: http://arxiv.org/abs/2512.06515v1
Date: Sat, 06 Dec 2025 18:00:37 GMT
Title: ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
Authors: Somnath Banerjee, Sayan Layek, Sayantan Adak, Mykola Pechenizkiy, Animesh Mukherjee, Rima Hazra,
Abstract summary: Current language model safety paradigms often fall short in emotionally charged or high-stakes settings.<n>We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses.
Score: 24.690320002468862
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

Related papers

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents [27.35968236632966]
LLM-based code interpreter agents are increasingly deployed in critical situations.<n>Existing benchmarks fail to capture the security risks arising from dynamic code execution, tool interactions, and multi-turn context.<n>We introduce CIBER, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation.
arXiv Detail & Related papers (2026-02-23T06:41:41Z)
Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models [7.422627253922975]
We introduce Certifiable Safe-RLHF, a cost model trained on a large-scale corpus to assign semantically grounded safety scores.<n>With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed, eliminating the need for dual-variable updates.<n> Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts.
arXiv Detail & Related papers (2025-10-03T21:24:41Z)
SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models [27.607151919652267]
Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks.<n>But their growing power amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms.<n>We propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans.
arXiv Detail & Related papers (2025-09-30T14:50:59Z)
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check [32.82170313959032]
We introduce a novel safety alignment approach called Answer-Then-Check.<n>Our method enables models to directly answer the question in their thought and then critically evaluate its safety.<n>We find that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset.
arXiv Detail & Related papers (2025-09-15T06:47:35Z)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.