Related papers: ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

URL: http://arxiv.org/abs/2601.04394v1
Date: Wed, 07 Jan 2026 21:04:37 GMT
Title: ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Authors: Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, Swagatam Das,
Abstract summary: We argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space.<n>We propose ARREST, a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections.
Score: 17.130698952440316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.

Related papers

Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z)
The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making [1.0671844383558033]
Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment with user biases.<n>We quantify a robustness gap where LLMs demonstrate 110-300 times greater resistance to narrative manipulation than human subjects.<n>We show that while LLMs may be "brittle" to how a query is formatted, they are remarkably "stable" against why a decision should be biased.
arXiv Detail & Related papers (2026-01-29T09:17:05Z)
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency [78.91846841708586]
We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference.<n>We propose Neighbor-Consistency Belief (NCB), a structural measure of belief that evaluates response coherence across a conceptual neighborhood.<n>We also present Structure-Aware Training (SAT), which optimize context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%.
arXiv Detail & Related papers (2026-01-09T16:23:21Z)
The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs [9.470098715212087]
enhancing truthfulness can negatively impact safety alignment.<n>In this paper, we show that increasing factual accuracy often comes at the cost of weakened refusal behavior.<n>We propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders.
arXiv Detail & Related papers (2025-10-09T04:30:58Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts [55.70338710797578]
We introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content.<n>Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs.<n>We introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals.
arXiv Detail & Related papers (2025-09-02T00:40:34Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment [30.605500809158986]
We propose a novel causal reward modeling approach that integrates causality to mitigate spurious correlations.<n>Our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences.
arXiv Detail & Related papers (2025-01-16T16:00:37Z)
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs [6.627477206883248]
Large language models (LLMs) are trained on a deluge of text data with limited quality control. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective.
arXiv Detail & Related papers (2024-08-02T17:55:50Z)
A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction. We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.