Related papers: Probabilistic Planning with Partially Ordered Preferences over Temporal Goals

Probabilistic Planning with Partially Ordered Preferences over Temporal Goals

URL: http://arxiv.org/abs/2209.12267v1
Date: Sun, 25 Sep 2022 17:13:24 GMT
Title: Probabilistic Planning with Partially Ordered Preferences over Temporal Goals
Authors: Hazhar Rahmani, Abhishek N. Kulkarni, and Jie Fu
Abstract summary: We study planning in Markov decision processes (MDPs) with preferences over temporally extended goals. We introduce a variant of deterministic finite automaton, referred to as a preference DFA, for specifying the user's preferences over temporally extended goals. We prove that a weak-stochastic nondominated policy given the preference specification is optimal in the constructed multi-objective MDP.
Score: 22.77805882908817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we study planning in stochastic systems, modeled as Markov decision processes (MDPs), with preferences over temporally extended goals. Prior work on temporal planning with preferences assumes that the user preferences form a total order, meaning that every pair of outcomes are comparable with each other. In this work, we consider the case where the preferences over possible outcomes are a partial order rather than a total order. We first introduce a variant of deterministic finite automaton, referred to as a preference DFA, for specifying the user's preferences over temporally extended goals. Based on the order theory, we translate the preference DFA to a preference relation over policies for probabilistic planning in a labeled MDP. In this treatment, a most preferred policy induces a weak-stochastic nondominated probability distribution over the finite paths in the MDP. The proposed planning algorithm hinges on the construction of a multi-objective MDP. We prove that a weak-stochastic nondominated policy given the preference specification is Pareto-optimal in the constructed multi-objective MDP, and vice versa. Throughout the paper, we employ a running example to demonstrate the proposed preference specification and solution approaches. We show the efficacy of our algorithm using the example with detailed analysis, and then discuss possible future directions.

Related papers

Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment [74.25832963097658]
Multi-Objective Alignment (MOA) aims to align responses with multiple human preference objectives. We find that DPO-based MOA approaches suffer from widespread preference conflicts in the data.
arXiv Detail & Related papers (2025-02-20T08:27:00Z)
Decentralized Planning Using Probabilistic Hyperproperties [0.16777183511743468]
We use an MDP describing how a single agent operates in an environment and probabilistic hyperproperties to capture desired temporal objectives. This lays the ground for the use of existing decentralized planning tools in the field of probabilistic hyperproperty verification.
arXiv Detail & Related papers (2025-02-19T10:59:02Z)
PIPA: Preference Alignment as Prior-Informed Statistical Estimation [57.24096291517857]
We introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework. PIPA accommodates both paired and unpaired data, as well as answer and step-level annotations. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N.
arXiv Detail & Related papers (2025-02-09T04:31:30Z)
VPO: Leveraging the Number of Votes in Preference Optimization [5.200545764106177]
We introduce a technique that leverages user voting data to better align with diverse subjective preferences. We develop the Vote-based Preference Optimization framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs.
arXiv Detail & Related papers (2024-10-30T10:39:34Z)
An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences. We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration. Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Belief-State Query Policies for Planning With Preferences Under Partial Observability [18.821166966365315]
Planning in real-world settings often entails addressing partial observability while aligning with users' preferences. We present a novel framework for expressing users' preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) preferences. We show that BSQ preferences provide a computationally feasible approach for planning with preferences in partially observable settings.
arXiv Detail & Related papers (2024-05-24T20:04:51Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs. Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Preference-Based Planning in Stochastic Environments: From Partially-Ordered Temporal Goals to Most Preferred Policies [25.731912021122287]
We consider systems modeled as Markov decision processes, given a partially ordered preference over a set of temporally extended goals. To plan with the partially ordered preference, we introduce order theory to map a preference over temporal goals to a preference over policies for the MDP. A most preferred policy under a ordering induces a nondominated probability distribution over the finite paths in the MDP.
arXiv Detail & Related papers (2024-03-27T02:46:09Z)
Probabilistic Planning with Prioritized Preferences over Temporal Logic Objectives [26.180359884973566]
We study temporal planning in probabilistic environments, modeled as labeled Markov decision processes (MDPs) This paper introduces a new specification language, termed prioritized qualitative choice linear temporal logic on finite traces. We formulate and solve a problem of computing an optimal policy that minimizes the expected score of dissatisfaction given user preferences.
arXiv Detail & Related papers (2023-04-23T13:03:27Z)
Probabilistic Permutation Graph Search: Black-Box Optimization for Fairness in Ranking [53.94413894017409]
We present a novel way of representing permutation distributions, based on the notion of permutation graphs. Similar to PL, our distribution representation, called PPG, can be used for black-box optimization of fairness.
arXiv Detail & Related papers (2022-04-28T20:38:34Z)
You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z)
Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning. Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z)
Probabilistic Planning with Preferences over Temporal Goals [21.35365462532568]
We present a formal language for specifying qualitative preferences over temporal goals and a preference-based planning method in systems. Using automata-theoretic modeling, the proposed specification allows us to express preferences over different sets of outcomes, where each outcome describes a set of temporal sequences of subgoals. We define the value of preference satisfaction given a process over possible outcomes and develop an algorithm for time-constrained probabilistic planning in labeled Markov decision processes.
arXiv Detail & Related papers (2021-03-26T14:26:40Z)
Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.