Adaptive Action Duration with Contextual Bandits for Deep Reinforcement Learning in Dynamic Environments
- URL: http://arxiv.org/abs/2507.00030v1
- Date: Tue, 17 Jun 2025 20:04:53 GMT
- Title: Adaptive Action Duration with Contextual Bandits for Deep Reinforcement Learning in Dynamic Environments
- Authors: Abhishek Verma, Nallarasan V, Balaraman Ravindran,
- Abstract summary: We propose a novel paradigm that integrates contextual bandits with Deep Reinforcement Learning (DRL)<n>Our approach augments a Deep Q-Network (DQN) with a contextual bandit module that learns to choose optimal action repetition rates based on state contexts.<n>Experiments on Atari 2600 games demonstrate significant performance improvements over static duration baselines.
- Score: 11.705324423141606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Reinforcement Learning (DRL) has achieved remarkable success in complex sequential decision-making tasks, such as playing Atari 2600 games and mastering board games. A critical yet underexplored aspect of DRL is the temporal scale of action execution. We propose a novel paradigm that integrates contextual bandits with DRL to adaptively select action durations, enhancing policy flexibility and computational efficiency. Our approach augments a Deep Q-Network (DQN) with a contextual bandit module that learns to choose optimal action repetition rates based on state contexts. Experiments on Atari 2600 games demonstrate significant performance improvements over static duration baselines, highlighting the efficacy of adaptive temporal abstractions in DRL. This paradigm offers a scalable solution for real-time applications like gaming and robotics, where dynamic action durations are critical.
Related papers
- Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL [37.58940726230092]
Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP)
We formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge.
We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart.
arXiv Detail & Related papers (2024-06-03T09:57:18Z) - RL-CFR: Improving Action Abstraction for Imperfect Information
Extensive-Form Games with Reinforcement Learning [42.80561441946148]
We introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction.
RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions.
In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64pm 11$ and $84pm 17$ mbb/hand, respectively.
arXiv Detail & Related papers (2024-03-07T09:12:23Z) - Reinforcement Learning with Elastic Time Steps [14.838483990647697]
Multi-Objective Soft Elastic Actor-Critic (MOSEAC) is an off-policy actor-critic algorithm that uses elastic time steps to dynamically adjust the control frequency.
We show that MOSEAC converges and produces stable policies at the theoretical level, and validate our findings in a real-time 3D racing game.
arXiv Detail & Related papers (2024-02-22T20:49:04Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Learning to Walk Autonomously via Reset-Free Quality-Diversity [73.08073762433376]
Quality-Diversity algorithms can discover large and complex behavioural repertoires consisting of both diverse and high-performing skills.
Existing QD algorithms need large numbers of evaluations as well as episodic resets, which require manual human supervision and interventions.
This paper proposes Reset-Free Quality-Diversity optimization (RF-QD) as a step towards autonomous learning for robotics in open-ended environments.
arXiv Detail & Related papers (2022-04-07T14:07:51Z) - RAPID-RL: A Reconfigurable Architecture with Preemptive-Exits for
Efficient Deep-Reinforcement Learning [7.990007201671364]
We propose a reconfigurable architecture with preemptive exits for efficient deep RL (RAPID-RL)
RAPID-RL enables conditional activation of preemptive layers based on the difficulty level of inputs.
We show that RAPID-RL incurs 0.34x (0.25x) number of operations (OPS) while maintaining performance above 0.88x (0.91x) on Atari (Drone navigation) tasks.
arXiv Detail & Related papers (2021-09-16T21:30:40Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for
Reinforcement Learning [84.30765628008207]
We propose a novel method, dubbed PlayVirtual, which augments cycle-consistent virtual trajectories to enhance the data efficiency for RL feature representation learning.
Our method outperforms the current state-of-the-art methods by a large margin on both benchmarks.
arXiv Detail & Related papers (2021-06-08T07:37:37Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.