Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model
- URL: http://arxiv.org/abs/2508.09971v1
- Date: Wed, 13 Aug 2025 17:39:09 GMT
- Title: Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model
- Authors: Zihan Wang, Nina Mahmoudian,
- Abstract summary: Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring.<n>We formalize river following as a coverage control problem in which the reward function is submodular, yielding diminishing returns as more unique river segments are visited.<n>We present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, the cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework.
- Score: 11.29011178752037
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. We formalize river following as a coverage control problem in which the reward function is submodular, yielding diminishing returns as more unique river segments are visited, thereby framing the task as a Submodular Markov Decision Process. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, thus aligning the advantage estimation with the agent's evolving recognition of action value in non-Markovian settings. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks that provides more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, the cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework capable of solving partially observable Constrained Submodular Markov Decision Processes. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach achieving the soft balance of reward and safety during training, while the safety layer enhances performance during inference by hard action overlay.
Related papers
- From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning [22.59885243102632]
Shaping Landscapes with Optimistic Potential Estimates (SLOPE) is a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes.<n>SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients.<n> Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.
arXiv Detail & Related papers (2026-02-03T07:13:26Z) - Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback [8.538830579425147]
We study estimation and statistical reward models used in aligning large language (LLMs)<n>A key component of LLM alignment is reinforcement learning from human feedback.
arXiv Detail & Related papers (2025-12-02T20:22:25Z) - Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs [3.198812241868092]
reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimize models on objectively measurable tasks.<n>We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR.<n> Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails.
arXiv Detail & Related papers (2025-11-26T04:36:34Z) - Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking [78.69179041551014]
We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
arXiv Detail & Related papers (2025-10-15T15:51:59Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection [0.0]
This paper introduces a novel Expectation-Maximization based latent variable modeling approach in the context of wildlife protection.<n>By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.<n>We implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard.
arXiv Detail & Related papers (2025-08-26T06:57:33Z) - Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics [34.570579623171476]
"First Reasoning, Then Forecasting" is a strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction.<n>We introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning scheme.<n>Our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.
arXiv Detail & Related papers (2025-07-16T09:46:17Z) - Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment [51.10604883057508]
We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning)<n>We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL.<n>Then we enhance Group Relative Policy Optimization (GRPO) by introducing rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps.
arXiv Detail & Related papers (2025-03-23T16:40:29Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Probabilistic Traffic Forecasting with Dynamic Regression [15.31488551912888]
This paper proposes a dynamic regression (DR) framework that enhances existing deeptemporal models by incorporating for learning the error process in traffic forecasting.<n>The framework relaxes the assumption of time independence by modeling the error series of the base model using a matrix- structured autoregressive (AR) model.<n>The newly designed loss function is based on the likelihood of a non-isotropic error term, enabling the model to generate probabilistic forecasts while preserving the original outputs of the base model.
arXiv Detail & Related papers (2023-01-17T01:12:44Z) - Benchmarking Safe Deep Reinforcement Learning in Aquatic Navigation [78.17108227614928]
We propose a benchmark environment for Safe Reinforcement Learning focusing on aquatic navigation.
We consider a value-based and policy-gradient Deep Reinforcement Learning (DRL)
We also propose a verification strategy that checks the behavior of the trained models over a set of desired properties.
arXiv Detail & Related papers (2021-12-16T16:53:56Z) - Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.