Related papers: Improving LLM Reliability through Hybrid Abstention and Adaptive Detection

Improving LLM Reliability through Hybrid Abstention and Adaptive Detection

URL: http://arxiv.org/abs/2602.15391v1
Date: Tue, 17 Feb 2026 07:00:09 GMT
Title: Improving LLM Reliability through Hybrid Abstention and Adaptive Detection
Authors: Ankit Sharma, Nachiket Tapas, Jyotiprakash Patra,
Abstract summary: Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off.<n>Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive.<n>We introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals.
Score: 1.9495934446083012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off either a strict filtering mechanisms prevent harmful outputs but often block benign queries or a relaxed controls risk unsafe content generation. Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive, resulting in high latency and degraded user experience. To address these limitations, we introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals such as domain and user history. The proposed framework integrates a multi-dimensional detection architecture composed of five parallel detectors, combined through a hierarchical cascade mechanism to optimize both speed and precision. The cascade design reduces unnecessary computation by progressively filtering queries, achieving substantial latency improvements compared to non-cascaded models and external guardrail systems. Extensive evaluation on mixed and domain-specific workloads demonstrates significant reductions in false positives, particularly in sensitive domains such as medical advice and creative writing. The system maintains high safety precision and near-perfect recall under strict operating modes. Overall, our context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.

Related papers

BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z)
Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z)
Kernel-Based Learning of Safety Barriers [0.9367224590861915]
Rapid integration of AI algorithms in safety-critical applications is raising concerns about the ability to meet stringent safety standards.<n>Traditional tools for formal safety verification struggle with the black-box nature of AI-driven systems.<n>We present a data-driven approach for safety verification and synthesis of black-box systems with discrete-time dynamics.
arXiv Detail & Related papers (2026-01-17T10:42:35Z)
SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models [7.422627253922975]
We introduce Certifiable Safe-RLHF, a cost model trained on a large-scale corpus to assign semantically grounded safety scores.<n>With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed, eliminating the need for dual-variable updates.<n> Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts.
arXiv Detail & Related papers (2025-10-03T21:24:41Z)
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention [68.95008546581339]
Existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality.<n>We propose CARE, a novel framework for decoding-time safety alignment that integrates three key components.<n>The framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience.
arXiv Detail & Related papers (2025-09-01T04:50:02Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
Unveiling Zero-Space Detection: A Novel Framework for Autonomous Ransomware Identification in High-Velocity Environments [0.0]
The proposed Zero-Space Detection framework identifies latent behavioral patterns through unsupervised clustering and advanced deep learning techniques.<n>It operates effectively in high-velocity environments by integrating multi-phase filtering and ensemble learning for refined decision-making.<n> Experimental evaluation reveals high detection rates across diverse ransomware families, including LockBit, Conti, REvil, and BlackMatter.
arXiv Detail & Related papers (2025-01-22T11:41:44Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Learning Predictive Safety Filter via Decomposition of Robust Invariant Set [6.94348936509225]
This paper presents advantages of both RMPC and RL RL to synthesize safety filters for nonlinear systems. We propose a policy approach for robust reach problems and establish its complexity.
arXiv Detail & Related papers (2023-11-12T08:11:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.