Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
- URL: http://arxiv.org/abs/2510.04073v1
- Date: Sun, 05 Oct 2025 07:24:23 GMT
- Title: Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention
- Authors: Santhosh Kumar Ravindran,
- Abstract summary: Key risk is value drift, where AI systems deviate from aligned values due to evolving contexts, learning dynamics, or unintended optimizations.<n>We propose the Moral Anchor System (MAS), a novel framework to detect, predict, and mitigate value drift in AI agents.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rise of artificial intelligence (AI) as super-capable assistants has transformed productivity and decision-making across domains. Yet, this integration raises critical concerns about value alignment - ensuring AI behaviors remain consistent with human ethics and intentions. A key risk is value drift, where AI systems deviate from aligned values due to evolving contexts, learning dynamics, or unintended optimizations, potentially leading to inefficiencies or ethical breaches. We propose the Moral Anchor System (MAS), a novel framework to detect, predict, and mitigate value drift in AI agents. MAS combines real-time Bayesian inference for monitoring value states, LSTM networks for forecasting drift, and a human-centric governance layer for adaptive interventions. It emphasizes low-latency responses (<20 ms) to prevent breaches, while reducing false positives and alert fatigue via supervised fine-tuning with human feedback. Our hypothesis: integrating probabilistic drift detection, predictive analytics, and adaptive governance can reduce value drift incidents by 80 percent or more in simulations, maintaining high detection accuracy (85 percent) and low false positive rates (0.08 post-adaptation). Rigorous experiments with goal-misaligned agents validate MAS's scalability and responsiveness. MAS's originality lies in its predictive and adaptive nature, contrasting static alignment methods. Contributions include: (1) MAS architecture for AI integration; (2) empirical results prioritizing speed and usability; (3) cross-domain applicability insights; and (4) open-source code for replication.
Related papers
- A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents [0.0]
We introduce Simulation Theology (ST) to foster persistent AI-human alignment.<n>ST posits reality as a computational simulation in which humanity functions as the primary training variable.<n>Unlike behavioral techniques such as reinforcement learning from human feedback, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity.
arXiv Detail & Related papers (2026-02-19T01:21:09Z) - Agentic Uncertainty Quantification [76.94013626702183]
We propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals.<n>Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary.
arXiv Detail & Related papers (2026-01-22T07:16:26Z) - The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks [51.468144272905135]
Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks.<n>We provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation.<n>We propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties.
arXiv Detail & Related papers (2025-12-11T08:09:07Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Active Test-time Vision-Language Navigation [60.69722522420299]
ATENA is a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes.<n>In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration.<n>In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions.
arXiv Detail & Related papers (2025-06-07T02:24:44Z) - Deciding When Not to Decide: Indeterminacy-Aware Intrusion Detection with NeutroSENSE [0.0]
NeutroSENSE is a neutrosophic-enhanced ensemble framework for interpretable intrusion detection in IoT environments.<n>System decomposes prediction confidence into truth (T), falsity (F), and indeterminacy (I) components, enabling uncertainty quantification and abstention.
arXiv Detail & Related papers (2025-06-05T11:43:31Z) - AI-Driven IRM: Transforming insider risk management with adaptive scoring and LLM-based threat detection [0.31457219084519]
Insider threats pose a significant challenge to organizational security.<n>IRM system integrates behavioral analytics, dynamic risk scoring, and real-time policy enforcement.<n>System reduces false positives by 59% and improves true positive detection rates by 30%.
arXiv Detail & Related papers (2025-05-01T18:41:00Z) - Trustworthiness in Stochastic Systems: Towards Opening the Black Box [1.7355698649527407]
behavior by an AI system threatens to undermine alignment and potential trust.<n>We take a philosophical perspective to the tension and potential conflict between foundationality and trustworthiness.<n>We propose latent value modeling for both AI systems and users to better assess alignment.
arXiv Detail & Related papers (2025-01-27T19:43:09Z) - Optical aberrations in autonomous driving: Physics-informed parameterized temperature scaling for neural network uncertainty calibration [49.03824084306578]
We propose to incorporate a physical inductive bias into the neural network calibration architecture to enhance the robustness and the trustworthiness of the AI target application.<n>We pave the way for a trustworthy uncertainty representation and for a holistic verification strategy of the perception chain.
arXiv Detail & Related papers (2024-12-18T10:36:46Z) - UAHOI: Uncertainty-aware Robust Interaction Learning for HOI Detection [18.25576487115016]
This paper focuses on Human-Object Interaction (HOI) detection.
It addresses the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame.
We propose a novel approach textscUAHOI, Uncertainty-aware Robust Human-Object Interaction Learning.
arXiv Detail & Related papers (2024-08-14T10:06:39Z) - On Minimizing Adversarial Counterfactual Error in Adversarial RL [18.044879441434432]
adversarial noise poses significant risks in safety-critical scenarios.<n>We introduce a novel objective called Adversarial Counterfactual Error (ACoE)<n>Our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges.
arXiv Detail & Related papers (2024-06-07T08:14:24Z) - RACER: Rational Artificial Intelligence Car-following-model Enhanced by Reality [46.909086734963665]
This paper introduces RACER, a cutting-edge deep learning car-following model to predict Adaptive Cruise Control (ACC) driving behavior.<n>Unlike conventional models, RACER effectively integrates Rational Driving Constraints (RDCs), crucial tenets of actual driving.<n> RACER excels across key metrics, such as acceleration, velocity, and spacing, registering zero violations.
arXiv Detail & Related papers (2023-12-12T06:21:30Z) - Interpretable Self-Aware Neural Networks for Robust Trajectory
Prediction [50.79827516897913]
We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among semantic concepts.
We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-16T06:28:20Z) - Adversarial vs behavioural-based defensive AI with joint, continual and
active learning: automated evaluation of robustness to deception, poisoning
and concept drift [62.997667081978825]
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security.
In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise.
arXiv Detail & Related papers (2020-01-13T13:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.