Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
- URL: http://arxiv.org/abs/2510.09330v1
- Date: Fri, 10 Oct 2025 12:32:43 GMT
- Title: Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
- Authors: Tuan Nguyen, Long Tran-Thanh,
- Abstract summary: Existing alignment approaches are costly and inflexible, requiring retraining whenever new requirements arise.<n>Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals.<n>We propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture.
- Score: 10.979571091316535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.
Related papers
- Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment [55.14890249389052]
Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction.<n>We propose textttQ-realign, a post-hoc defense method based on post-training quantization.<n>Our work provides a practical, turnkey solution for safety-aware deployment.
arXiv Detail & Related papers (2026-01-13T00:07:24Z) - Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models [3.710103086278309]
Large Language Models (LLMs) are typically aligned for safety during the post-training phase.<n>They may still generate inappropriate outputs that could potentially pose risks to users.<n>This challenge underscores the need for robust safeguards that operate across both model inputs and outputs.
arXiv Detail & Related papers (2025-12-05T00:43:55Z) - Alignment-Aware Quantization for LLM Safety [30.635936212381726]
Safety and efficiency are important factors when deploying large language models (LLMs)<n>We propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive( APC) loss into the PTQ pipeline.<n>AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families.
arXiv Detail & Related papers (2025-11-11T05:24:30Z) - Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z) - COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability [101.80200069234377]
We present COSMO-RL, a mixed reinforcement learning framework that trains LMRMs under multimodal, multitask, and multiobjective signals.<n>Our approach aims to let safety and capability grow together in one stable pipeline rather than competing during alignment.
arXiv Detail & Related papers (2025-10-05T13:30:03Z) - Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training [1.5349686675266894]
Current methods for content safety in Large Language Models (LLMs) rely on multi-stage training pipelines.<n>We propose a unified co-training framework that efficiently integrates multiple safety behaviors.<n>We show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance.
arXiv Detail & Related papers (2025-08-12T02:39:33Z) - Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes [1.0779346838250028]
Latent Prototype Moderator (LPM) is a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety.<n>LPM matches or exceeds state-of-the-art guard models across multiple safety benchmarks.
arXiv Detail & Related papers (2025-02-22T10:31:50Z) - On Almost Surely Safe Alignment of Large Language Models at Inference-Time [20.5164976103514]
We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely.<n>We augment a safety state that tracks the evolution of safety constraints and dynamically penalizes unsafe generations.<n>We demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties.
arXiv Detail & Related papers (2025-02-03T09:59:32Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - SLM as Guardian: Pioneering AI Safety with Small Language Models [6.799423428734095]
Internalizing safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation.
We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
arXiv Detail & Related papers (2024-05-30T08:03:15Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.