Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical
Report
- URL: http://arxiv.org/abs/2104.06139v2
- Date: Wed, 14 Apr 2021 10:32:18 GMT
- Title: Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical
Report
- Authors: Chao Xu, Yiping Xie, Xijun Wang, Howard H. Yang, Dusit Niyato, Tony Q.
S. Quek
- Abstract summary: We have struck the balance between the information freshness, experienced by users and energy consumed by sensors.
We cast the corresponding status update procedure as a continuing Markov Decision Process (MDP)
To circumvent the curse of dimensionality, we have established a methodology for designing deep reinforcement learning (DRL) algorithms.
- Score: 117.23323653198297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, we have struck the balance between the information freshness, in
terms of age of information (AoI), experienced by users and energy consumed by
sensors, by appropriately activating sensors to update their current status in
caching enabled Internet of Things (IoT) networks [1]. To solve this problem,
we cast the corresponding status update procedure as a continuing Markov
Decision Process (MDP) (i.e., without termination states), where the number of
state-action pairs increases exponentially with respect to the number of
considered sensors and users. Moreover, to circumvent the curse of
dimensionality, we have established a methodology for designing deep
reinforcement learning (DRL) algorithms to maximize (resp. minimize) the
average reward (resp. cost), by integrating R-learning, a tabular reinforcement
learning (RL) algorithm tailored for maximizing the long-term average reward,
and traditional DRL algorithms, initially developed to optimize the discounted
long-term cumulative reward rather than the average one. In this technical
report, we would present detailed discussions on the technical contributions of
this methodology.
Related papers
- Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes [7.028778922533688]
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty.
We study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning.
arXiv Detail & Related papers (2024-10-14T14:52:23Z) - Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement
Learning Approach [58.911515417156174]
We propose a new definition of Age of Information (AoI) and, based on the redefined AoI, we formulate an online AoI problem for MEC systems.
We introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics.
We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness.
arXiv Detail & Related papers (2023-12-01T01:30:49Z) - Optimal Scheduling in IoT-Driven Smart Isolated Microgrids Based on Deep
Reinforcement Learning [10.924928763380624]
We investigate the scheduling issue of diesel generators (DGs) in an Internet of Things-Driven microgrid (MG) by deep reinforcement learning (DRL)
The DRL agent learns an optimal policy from history renewable and load data of previous days.
The goal is to reduce operating cost on the premise of ensuring supply-demand balance.
arXiv Detail & Related papers (2023-04-28T23:52:50Z) - Deep Reinforcement Learning Based Multidimensional Resource Management
for Energy Harvesting Cognitive NOMA Communications [64.1076645382049]
Combination of energy harvesting (EH), cognitive radio (CR), and non-orthogonal multiple access (NOMA) is a promising solution to improve energy efficiency.
In this paper, we study the spectrum, energy, and time resource management for deterministic-CR-NOMA IoT systems.
arXiv Detail & Related papers (2021-09-17T08:55:48Z) - Improving Long-Term Metrics in Recommendation Systems using
Short-Horizon Offline RL [56.20835219296896]
We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility.
We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions.
arXiv Detail & Related papers (2021-06-01T15:58:05Z) - Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput.
Until today, no such learning-based algorithms have shown practical potential in this domain.
We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks.
We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z) - Proximal Deterministic Policy Gradient [20.951797549505986]
We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms.
We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate.
We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
arXiv Detail & Related papers (2020-08-03T10:19:59Z) - Learning Centric Power Allocation for Edge Intelligence [84.16832516799289]
Edge intelligence has been proposed, which collects distributed data and performs machine learning at the edge.
This paper proposes a learning centric power allocation (LCPA) method, which allocates radio resources based on an empirical classification error model.
Experimental results show that the proposed LCPA algorithm significantly outperforms other power allocation algorithms.
arXiv Detail & Related papers (2020-07-21T07:02:07Z) - Stacked Auto Encoder Based Deep Reinforcement Learning for Online
Resource Scheduling in Large-Scale MEC Networks [44.40722828581203]
An online resource scheduling framework is proposed for minimizing the sum of weighted task latency for all the Internet of things (IoT) users.
A deep reinforcement learning (DRL) based solution is proposed, which includes the following components.
A preserved and prioritized experience replay (2p-ER) is introduced to assist the DRL to train the policy network and find the optimal offloading policy.
arXiv Detail & Related papers (2020-01-24T23:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.