Related papers: Learning Reward Machines through Preference Queries over Sequences

Learning Reward Machines through Preference Queries over Sequences

URL: http://arxiv.org/abs/2308.09301v1
Date: Fri, 18 Aug 2023 04:49:45 GMT
Title: Learning Reward Machines through Preference Queries over Sequences
Authors: Eric Hsiung, Joydeep Biswas, Swarat Chaudhuri
Abstract summary: We contribute REMAP, a novel algorithm for learning reward machines from preferences. In addition to the proofs of correctness and termination for REMAP, we present empirical evidence measuring correctness.
Score: 19.478224060277775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward machines have shown great promise at capturing non-Markovian reward functions for learning tasks that involve complex action sequencing. However, no algorithm currently exists for learning reward machines with realistic weak feedback in the form of preferences. We contribute REMAP, a novel algorithm for learning reward machines from preferences, with correctness and termination guarantees. REMAP introduces preference queries in place of membership queries in the L* algorithm, and leverages a symbolic observation table along with unification and constraint solving to narrow the hypothesis reward machine search space. In addition to the proofs of correctness and termination for REMAP, we present empirical evidence measuring correctness: how frequently the resulting reward machine is isomorphic under a consistent yet inexact teacher, and the regret between the ground truth and learned reward machines.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [70.01883340129204]
Single-Pass.<n>with Reference-Guided Evaluation (SPARE)<n>Novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation.<n>SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
Learning Quantitative Automata Modulo Theories [17.33092604696224]
We present QUINTIC, an active learning algorithm, wherein the learner infers a valid automaton through deductive reasoning. Our evaluations utilize theory of rationals in order to learn summation, discounted summation, product, and classification quantitative automata.
arXiv Detail & Related papers (2024-11-15T21:51:14Z)
LLMs as Probabilistic Minimally Adequate Teachers for DFA Learning [11.037017229299607]
The emergence of intelligence in large language models (LLMs) has inspired investigations into their integration into automata learning. This paper introduces the probabilistic Minimally Adequate Teacher (pMAT) formulation. We develop techniques to improve answer accuracy and ensure the correctness of the learned automata.
arXiv Detail & Related papers (2024-08-06T07:12:09Z)
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications. The quality of these exemplars in the prompt greatly impacts performance. Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
Learning Hidden Markov Models Using Conditional Samples [72.20944611510198]
This paper is concerned with the computational complexity of learning the Hidden Markov Model (HMM) In this paper, we consider an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs. Specifically, we obtain efficient algorithms for learning HMMs in settings where we have query access to the exact conditional probabilities.
arXiv Detail & Related papers (2023-02-28T16:53:41Z)
Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently. We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z)
Induction and Exploitation of Subgoal Automata for Reinforcement Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals. A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.