Probabilistic Planning with Partially Ordered Preferences over Temporal
Goals
- URL: http://arxiv.org/abs/2209.12267v1
- Date: Sun, 25 Sep 2022 17:13:24 GMT
- Title: Probabilistic Planning with Partially Ordered Preferences over Temporal
Goals
- Authors: Hazhar Rahmani, Abhishek N. Kulkarni, and Jie Fu
- Abstract summary: We study planning in Markov decision processes (MDPs) with preferences over temporally extended goals.
We introduce a variant of deterministic finite automaton, referred to as a preference DFA, for specifying the user's preferences over temporally extended goals.
We prove that a weak-stochastic nondominated policy given the preference specification is optimal in the constructed multi-objective MDP.
- Score: 22.77805882908817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study planning in stochastic systems, modeled as Markov
decision processes (MDPs), with preferences over temporally extended goals.
Prior work on temporal planning with preferences assumes that the user
preferences form a total order, meaning that every pair of outcomes are
comparable with each other. In this work, we consider the case where the
preferences over possible outcomes are a partial order rather than a total
order. We first introduce a variant of deterministic finite automaton, referred
to as a preference DFA, for specifying the user's preferences over temporally
extended goals. Based on the order theory, we translate the preference DFA to a
preference relation over policies for probabilistic planning in a labeled MDP.
In this treatment, a most preferred policy induces a weak-stochastic
nondominated probability distribution over the finite paths in the MDP. The
proposed planning algorithm hinges on the construction of a multi-objective
MDP. We prove that a weak-stochastic nondominated policy given the preference
specification is Pareto-optimal in the constructed multi-objective MDP, and
vice versa. Throughout the paper, we employ a running example to demonstrate
the proposed preference specification and solution approaches. We show the
efficacy of our algorithm using the example with detailed analysis, and then
discuss possible future directions.
Related papers
- VPO: Leveraging the Number of Votes in Preference Optimization [5.200545764106177]
We introduce a technique that leverages user voting data to better align with diverse subjective preferences.
We develop the Vote-based Preference Optimization framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs.
arXiv Detail & Related papers (2024-10-30T10:39:34Z) - An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences.
We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration.
Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Belief-State Query Policies for Planning With Preferences Under Partial Observability [18.821166966365315]
Planning in real-world settings often entails addressing partial observability while aligning with users' preferences.
We present a novel framework for expressing users' preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) preferences.
We show that BSQ preferences provide a computationally feasible approach for planning with preferences in partially observable settings.
arXiv Detail & Related papers (2024-05-24T20:04:51Z) - Preference-Based Planning in Stochastic Environments: From Partially-Ordered Temporal Goals to Most Preferred Policies [25.731912021122287]
We consider systems modeled as Markov decision processes, given a partially ordered preference over a set of temporally extended goals.
To plan with the partially ordered preference, we introduce order theory to map a preference over temporal goals to a preference over policies for the MDP.
A most preferred policy under a ordering induces a nondominated probability distribution over the finite paths in the MDP.
arXiv Detail & Related papers (2024-03-27T02:46:09Z) - Probabilistic Planning with Prioritized Preferences over Temporal Logic
Objectives [26.180359884973566]
We study temporal planning in probabilistic environments, modeled as labeled Markov decision processes (MDPs)
This paper introduces a new specification language, termed prioritized qualitative choice linear temporal logic on finite traces.
We formulate and solve a problem of computing an optimal policy that minimizes the expected score of dissatisfaction given user preferences.
arXiv Detail & Related papers (2023-04-23T13:03:27Z) - Probabilistic Permutation Graph Search: Black-Box Optimization for
Fairness in Ranking [53.94413894017409]
We present a novel way of representing permutation distributions, based on the notion of permutation graphs.
Similar to PL, our distribution representation, called PPG, can be used for black-box optimization of fairness.
arXiv Detail & Related papers (2022-04-28T20:38:34Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Probabilistic Planning with Preferences over Temporal Goals [21.35365462532568]
We present a formal language for specifying qualitative preferences over temporal goals and a preference-based planning method in systems.
Using automata-theoretic modeling, the proposed specification allows us to express preferences over different sets of outcomes, where each outcome describes a set of temporal sequences of subgoals.
We define the value of preference satisfaction given a process over possible outcomes and develop an algorithm for time-constrained probabilistic planning in labeled Markov decision processes.
arXiv Detail & Related papers (2021-03-26T14:26:40Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.