Related papers: Reinforcement Learning with History-Dependent Dynamic Contexts

Reinforcement Learning with History-Dependent Dynamic Contexts

URL: http://arxiv.org/abs/2302.02061v2
Date: Thu, 18 May 2023 02:08:52 GMT
Title: Reinforcement Learning with History-Dependent Dynamic Contexts
Authors: Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutilier
Abstract summary: We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features.
Score: 29.8131459650617
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.

Related papers

Embodied CoT Distillation From LLM To Off-the-shelf Agents [6.318203525449058]
DeDer is a framework for decomposing and distilling the embodied reasoning capabilities from large language models (LLMs) Our experiments with the ALFRED benchmark demonstrate that DeDer surpasses leading language planning and distillation approaches.
arXiv Detail & Related papers (2024-12-16T07:18:02Z)
Dynamical-VAE-based Hindsight to Learn the Causal Dynamics of Factored-POMDPs [9.662551514840388]
We introduce a Dynamical Variational Auto-Encoder (DVAE) designed to learn causal Markovian dynamics from offline trajectories. Our method employs an extended hindsight framework that integrates past, current, and multi-step future information. Empirical results reveal that this approach uncovers the causal graph governing hidden state transitions more effectively than history-based and typical hindsight-based models.
arXiv Detail & Related papers (2024-11-12T14:27:45Z)
Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation [51.06031200728449]
We propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy. Results observe significant performance improvement by our method, compared with several well-known baselines.
arXiv Detail & Related papers (2024-09-11T17:01:06Z)
Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy. To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization. The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z)
Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning [1.8549313085249324]
This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions. The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%.
arXiv Detail & Related papers (2024-06-07T01:30:21Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
On learning history based policies for controlling Markov decision processes [44.17941122294582]
We introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP. We numerically evaluate its effectiveness on a set of continuous control tasks.
arXiv Detail & Related papers (2022-11-06T02:47:55Z)
Data Augmentation through Expert-guided Symmetry Detection to Improve Performance in Offline Reinforcement Learning [0.0]
offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task. Recent works showed that an expert-guided pipeline relying on Density Estimation methods effectively detects this structure in deterministic environments. We show that the former results lead to a performance improvement when solving the learned MDP and then applying the optimized policy in the real environment.
arXiv Detail & Related papers (2021-12-18T14:32:32Z)
Learning to Continuously Optimize Wireless Resource in a Dynamic Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment. We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes. Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
Learning Robust State Abstractions for Hidden-Parameter Block MDPs [55.31018404591743]
We leverage ideas of common structure from the HiP-MDP setting to enable robust state abstractions inspired by Block MDPs. We derive instantiations of this new framework for both multi-task reinforcement learning (MTRL) and meta-reinforcement learning (Meta-RL) settings.
arXiv Detail & Related papers (2020-07-14T17:25:27Z)
Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation [41.21447375318793]
We introduce a modelling strategy based on a joint kernel embedding of contexts and actions. We empirically show that the optimization aspect of counterfactual learning is important. We propose an evaluation protocol for offline policies in real-world logged systems.
arXiv Detail & Related papers (2020-04-22T07:42:30Z)
A Dependency Syntactic Knowledge Augmented Interactive Architecture for End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA. This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn) Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.