Inferring Reward Machines and Transition Machines from Partially Observable Markov Decision Processes
- URL: http://arxiv.org/abs/2508.01947v1
- Date: Sun, 03 Aug 2025 22:53:25 GMT
- Title: Inferring Reward Machines and Transition Machines from Partially Observable Markov Decision Processes
- Authors: Yuly Wu, Jiamou Liu, Libo Zhang,
- Abstract summary: Inferring an automaton to handle the non-Markovianity is a proven effective approach, but faces two limitations.<n>To develop a unified inference algorithm for both automata types, we propose the Dual Behavior Mealy Machine (DBMM) that subsumes both TMs and RMs.<n>We then introduce DB-RPNI, a passive automata learning algorithm that efficiently infers DBMMs while avoiding the costly reductions required by prior work.
- Score: 13.173882985068218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Partially Observable Markov Decision Processes (POMDPs) are fundamental to many real-world applications. Although reinforcement learning (RL) has shown success in fully observable domains, learning policies from traces in partially observable environments remains challenging due to non-Markovian observations. Inferring an automaton to handle the non-Markovianity is a proven effective approach, but faces two limitations: 1) existing automaton representations focus only on reward-based non-Markovianity, leading to unnatural problem formulations; 2) inference algorithms face enormous computational costs. For the first limitation, we introduce Transition Machines (TMs) to complement existing Reward Machines (RMs). To develop a unified inference algorithm for both automata types, we propose the Dual Behavior Mealy Machine (DBMM) that subsumes both TMs and RMs. We then introduce DB-RPNI, a passive automata learning algorithm that efficiently infers DBMMs while avoiding the costly reductions required by prior work. We further develop optimization techniques and identify sufficient conditions for inferring the minimal correct automata. Experimentally, our inference method achieves speedups of up to three orders of magnitude over SOTA baselines.
Related papers
- Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning [10.227089771963943]
We propose an uncertainty-driven framework for automated process reward data construction.<n>We introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and weighted Reward Frequency Vote.<n>Experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework.
arXiv Detail & Related papers (2025-08-03T14:14:13Z) - Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z) - Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels [5.868949328814509]
Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation.<n>Existing machine unlearning (MU) methods fail to address two fundamental limitations in quantized networks.<n>We propose Q-MUL, the first dedicated unlearning framework for quantized models.
arXiv Detail & Related papers (2025-03-18T05:22:13Z) - DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z) - Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning [11.15613673478208]
The Process Reward Model (PRM) plays a crucial role in mathematical reasoning tasks, requiring high-quality supervised process data.<n>We observe that reasoning steps generated by Large Language Models (LLMs) often fail to exhibit strictly incremental information, leading to redundancy.<n>We propose CFPRM, a simple yet effective coarse-to-fine strategy for detecting redundant steps.
arXiv Detail & Related papers (2025-01-23T12:44:45Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs)
Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally.
Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z) - Unsupervised Automata Learning via Discrete Optimization [7.06671668667062]
We propose a framework for learning a deterministic finite automaton (DFA) from a given multi-set of unlabeled words.<n>We show that this problem is computationally hard and develop three learning algorithms based on constraint optimization.<n>We introduce novel regularization schemes for our optimization problems that improve the overall interpretability of our DFAs.
arXiv Detail & Related papers (2023-03-24T16:19:15Z) - Online Learning under Budget and ROI Constraints via Weak Adaptivity [57.097119428915796]
Existing primal-dual algorithms for constrained online learning problems rely on two fundamental assumptions.
We show how such assumptions can be circumvented by endowing standard primal-dual templates with weakly adaptive regret minimizers.
We prove the first best-of-both-worlds no-regret guarantees which hold in absence of the two aforementioned assumptions.
arXiv Detail & Related papers (2023-02-02T16:30:33Z) - Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline
Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms.
Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.