Related papers: An Index Policy Based on Sarsa and Q-learning for Heterogeneous Smart Target Tracking

An Index Policy Based on Sarsa and Q-learning for Heterogeneous Smart Target Tracking

URL: http://arxiv.org/abs/2402.12015v1
Date: Mon, 19 Feb 2024 10:13:25 GMT
Title: An Index Policy Based on Sarsa and Q-learning for Heterogeneous Smart Target Tracking
Authors: Yuhang Hao and Zengfu Wang and Jing Fu and Quan Pan
Abstract summary: We propose a new policy, namely ISQ, to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional Q-learning-based methods.
Score: 13.814608044569967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In solving the non-myopic radar scheduling for multiple smart target tracking within an active and passive radar network, we need to consider both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking. Acquiring the long-term tracking performance while scheduling the beam resources of active and passive radars poses a challenge. To address this challenge, we model this problem as a Markov decision process consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which the estimation state evolves according to different discrete dynamic models for different actions - whether or not the target is being tracked. The discrete state is defined by the dynamic mode. The problem exhibits the curse of dimensionality, where optimal solutions are in general intractable. We resort to heuristics through the famous restless multi-armed bandit techniques. It follows with efficient scheduling policies based on the indices that are real numbers representing the marginal rewards of taking different actions. For the inevitable practical case with unknown transition matrices, we propose a new method that utilizes the forward Sarsa and backward Q-learning to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions, and propose a new policy, namely ISQ, aiming to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional Q-learning-based methods and rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.

Related papers

Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z)
Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction [3.2883573376133555]
Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer.<n>SADQ integrates successor-state distributions into the Q-value estimation process.<n>We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance.
arXiv Detail & Related papers (2025-11-05T20:04:53Z)
A Hybrid Approach for Visual Multi-Object Tracking [3.259045978275386]
This paper proposes a visual multi-object tracking method to ensure consistency for unknown and time-varying target numbers under nonlinear dynamics.<n>A particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle optimization (PSO) to guide particles toward state distribution modes.<n>A novel scheme is proposed for the smooth updating of target states while preserving their identities.
arXiv Detail & Related papers (2025-10-28T13:22:24Z)
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z)
Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models [71.34520793462069]
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. We introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem.
arXiv Detail & Related papers (2025-04-15T10:41:11Z)
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction [7.918703013303246]
We present Latent Macro Action Planner (L-MAP), which addresses the challenge of learning to make decisions in high-dimensional continuous action spaces. L-MAP learns a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE) In offline RL settings, including continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns.
arXiv Detail & Related papers (2025-02-28T16:02:23Z)
POMDP-Driven Cognitive Massive MIMO Radar: Joint Target Detection-Tracking In Unknown Disturbances [42.99053410696693]
This work explores the application of a Partially Observable Markov Decision Process framework to enhance the tracking and detection tasks. The proposed approach employs an online algorithm that does not require any apriori knowledge of the noise statistics.
arXiv Detail & Related papers (2024-10-23T15:34:11Z)
Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives [16.101435842520473]
This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem. Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. We present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space.
arXiv Detail & Related papers (2024-06-05T02:33:50Z)
Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z)
Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System [2.0429716172112617]
This paper presents a reinforcement learning framework that integrates a policy optimization approach with net dynamics simulations. A state transition model is considered in order to incorporate synthetic uncertainties in state estimation and launch actuation. The trained policy demonstrates capture performance close to that obtained with reliability-based optimization run over an individual scenario.
arXiv Detail & Related papers (2022-01-11T20:09:05Z)
Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z)
Learning to Track Dynamic Targets in Partially Known Environments [48.49957897251128]
We use a deep reinforcement learning approach to solve active target tracking. In particular, we introduce Active Tracking Target Network (ATTN), a unified RL policy that is capable of solving major sub-tasks of active target tracking.
arXiv Detail & Related papers (2020-06-17T22:45:24Z)
Online Reinforcement Learning Control by Direct Heuristic Dynamic Programming: from Time-Driven to Event-Driven [80.94390916562179]
Time-driven learning refers to the machine learning method that updates parameters in a prediction model continuously as new data arrives. It is desirable to prevent the time-driven dHDP from updating due to insignificant system event such as noise. We show how the event-driven dHDP algorithm works in comparison to the original time-driven dHDP.
arXiv Detail & Related papers (2020-06-16T05:51:25Z)
Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
Evolutionary Stochastic Policy Distillation [139.54121001226451]
We propose a new method called Evolutionary Policy Distillation (ESPD) to solve GCRS tasks. ESPD enables a target policy to learn from a series of its variants through the technique of policy distillation (PD) The experiments based on the MuJoCo control suite show the high learning efficiency of the proposed method.
arXiv Detail & Related papers (2020-04-27T16:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.