On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration
- URL: http://arxiv.org/abs/2501.12785v2
- Date: Tue, 21 Oct 2025 14:03:30 GMT
- Title: On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration
- Authors: Yirui Zhou, Yunfei Jin, Xiaowei Liu, Xiaofeng Zhang, Yangchun Zhang,
- Abstract summary: Learning from observations (LfO) replicates expert behavior without needing access to the expert's actions.<n>However, directly applying the on-policy training scheme in LfO worsens the sample inefficiency problem.<n>This paper seeks to develop an efficient and stable solution for the LfO problem.
- Score: 5.82641114510464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from observations (LfO) replicates expert behavior without needing access to the expert's actions, making it more practical than learning from demonstrations (LfD) in many real-world scenarios. However, directly applying the on-policy training scheme in LfO worsens the sample inefficiency problem, while employing the traditional off-policy training scheme in LfO magnifies the instability issue. This paper seeks to develop an efficient and stable solution for the LfO problem. Specifically, we begin by exploring the generalization capabilities of both the reward function and policy in LfO, which provides a theoretical foundation for computation. Building on this, we modify the policy optimization method in generative adversarial imitation from observation (GAIfO) with distributional soft actor-critic (DSAC), and propose the Mimicking Observations through Distributional Update Learning with adequate Exploration (MODULE) algorithm to solve the LfO problem. MODULE incorporates the advantages of (1) high sample efficiency and training robustness enhancement in soft actor-critic (SAC), and (2) training stability in distributional reinforcement learning (RL). Extensive experiments in MuJoCo environments showcase the superior performance of MODULE over current LfO methods.
Related papers
- CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs [31.371566320424552]
CoBA-RL is a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability.<n>Our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements.
arXiv Detail & Related papers (2026-02-03T03:14:36Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning [55.36978389831446]
We recast reflective exploration within the Bayes-Adaptive RL framework.<n>Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on observed outcomes.
arXiv Detail & Related papers (2025-05-26T22:51:00Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning [6.691759477350243]
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque.
This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions.
We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences.
arXiv Detail & Related papers (2024-10-16T12:14:25Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Preference-Guided Reinforcement Learning for Efficient Exploration [14.058764537783086]
We introduce LOPE: textbfLearning textbfOnline with trajectory textbfPreferencedanctextbfE, an end-to-end preference-guided RL framework.<n>Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.<n>LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Generalizing Reward Modeling for Out-of-Distribution Preference Learning [3.9160947065896803]
Preference learning with large language models (LLMs) aims to align the LLMs' generations with human preferences.
Due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging.
This work addresses OOD PL by optimizing a general reward model through a meta-learning approach.
arXiv Detail & Related papers (2024-02-22T18:20:33Z) - Assessing the Impact of Distribution Shift on Reinforcement Learning
Performance [0.0]
Reinforcement learning (RL) faces its own set of unique challenges.
Comparison of point estimates, and plots that show successful convergence to the optimal policy during training, may obfuscate overfitting or dependence on the experimental setup.
We propose a set of evaluation methods that measure the robustness of RL algorithms under distribution shifts.
arXiv Detail & Related papers (2024-02-05T23:50:55Z) - Noise Distribution Decomposition based Multi-Agent Distributional Reinforcement Learning [15.82785057592436]
Multi-Agent Reinforcement Learning (MARL) is more susceptible to noise due to the interference among intelligent agents.
We propose a novel decomposition-based multi-agent distributional RL method by approxing the globally shared noisy reward.
We also verify the effectiveness of the proposed method through extensive simulation experiments with noisy rewards.
arXiv Detail & Related papers (2023-12-12T07:24:15Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Inverse Reinforcement Learning for Text Summarization [52.765898203824975]
We introduce inverse reinforcement learning (IRL) as an effective paradigm for training abstractive summarization models.
Experimental results across datasets in different domains demonstrate the superiority of our proposed IRL model for summarization over MLE and RL baselines.
arXiv Detail & Related papers (2022-12-19T23:45:05Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Exploration with Multi-Sample Target Values for Distributional
Reinforcement Learning [20.680417111485305]
We introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation.
The improved distributional estimates lend themselves to UCB-based exploration.
We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control.
arXiv Detail & Related papers (2022-02-06T03:27:05Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Bayesian Distributional Policy Gradients [2.28438857884398]
Distributional Reinforcement Learning maintains the entire probability distribution of the reward-to-go, i.e. the return.
Bayesian Distributional Policy Gradients (BDPG) uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns.
arXiv Detail & Related papers (2021-03-20T23:42:50Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.