Anticipating Oblivious Opponents in Stochastic Games
- URL: http://arxiv.org/abs/2409.11671v1
- Date: Wed, 18 Sep 2024 03:17:40 GMT
- Title: Anticipating Oblivious Opponents in Stochastic Games
- Authors: Shadi Tasdighi Kalat, Sriram Sankaranarayanan, Ashutosh Trivedi,
- Abstract summary: We introduce a notion of consistency that guarantees that the belief states tracked by our automaton stays within a fixed distance of the precise belief state obtained by knowledge of the full history.
We show how the information state machine yields an MDP that serves as the starting point for computing optimal policies for maximizing a reward function defined over plays.
- Score: 7.9266383017424795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an approach for systematically anticipating the actions and policies employed by \emph{oblivious} environments in concurrent stochastic games, while maximizing a reward function. Our main contribution lies in the synthesis of a finite \emph{information state machine} whose alphabet ranges over the actions of the environment. Each state of the automaton is mapped to a belief state about the policy used by the environment. We introduce a notion of consistency that guarantees that the belief states tracked by our automaton stays within a fixed distance of the precise belief state obtained by knowledge of the full history. We provide methods for checking consistency of an automaton and a synthesis approach which upon successful termination yields such a machine. We show how the information state machine yields an MDP that serves as the starting point for computing optimal policies for maximizing a reward function defined over plays. We present an experimental evaluation over benchmark examples including human activity data for tasks such as cataract surgery and furniture assembly, wherein our approach successfully anticipates the policies and actions of the environment in order to maximize the reward.
Related papers
- Evaluating Robot Policies in a World Model [54.874926065292904]
We investigate World-model-based Policy Evaluation (WPE)<n>WPE achieves high fidelity in mimicing robot arm movements as in real videos.<n>We show that WPE can serve as a starting point for evaluating robot policies before real-world deployment.
arXiv Detail & Related papers (2025-05-31T15:51:56Z) - Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [83.21177515180564]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z) - Learning Reward Machines from Partially Observed Policies [0.40964539027092917]
Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert.<n>Our goal is to identify the true reward machine using finite information.
arXiv Detail & Related papers (2025-02-06T03:48:25Z) - Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures [1.75493501156941]
We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy.
For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions visited during the next time steps.
arXiv Detail & Related papers (2024-12-09T16:56:06Z) - Non-maximizing policies that fulfill multi-criterion aspirations in expectation [0.7874708385247353]
In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent is usually determined by expressing the goal as a scalar reward function.
We consider finite acyclic Decision Markov Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized.
Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible.
arXiv Detail & Related papers (2024-08-08T11:41:04Z) - Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [25.760946763103483]
We propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks.
Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation.
arXiv Detail & Related papers (2024-06-17T17:00:41Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Covert Planning against Imperfect Observers [29.610121527096286]
Covert planning refers to a class of constrained planning problems where an agent aims to accomplish a task with minimal information leaked to a passive observer to avoid detection.
This paper studies how covert planning can leverage the coupling of dynamics and the observer's imperfect observation to achieve optimal performance without being detected.
arXiv Detail & Related papers (2023-10-25T17:23:57Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z) - Instance-Aware Predictive Navigation in Multi-Agent Environments [93.15055834395304]
We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures.
We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view.
We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level.
arXiv Detail & Related papers (2021-01-14T22:21:25Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.