Reward-Aware Proto-Representations in Reinforcement Learning
- URL: http://arxiv.org/abs/2505.16217v1
- Date: Thu, 22 May 2025 04:33:00 GMT
- Title: Reward-Aware Proto-Representations in Reinforcement Learning
- Authors: Hon Tik Tse, Siddarth Chandrasekar, Marlos C. Machado,
- Abstract summary: In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL)<n>In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem.<n>Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.
- Score: 6.855996110012974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.
Related papers
- Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models [3.7957452405531265]
We introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for symbolic regression.<n>Using PATCHES, we isolate 28 circuits, providing the first circuit-level characterisation of an SR transformer.
arXiv Detail & Related papers (2026-02-03T13:27:10Z) - Implicit Neural Representation-Based Continuous Single Image Super Resolution: An Empirical Study [50.15623093332659]
Implicit neural representation (INR) has become the standard approach for arbitrary-scale image super-resolution (ASSR)<n>We compare existing techniques across diverse settings and present aggregated performance results on multiple image quality metrics.<n>We examine a new loss function that penalizes intensity variations while preserving edges, textures, and finer details during training.
arXiv Detail & Related papers (2026-01-25T07:09:20Z) - Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR [67.66592867046229]
Character-R1 is a framework designed to provide verifiable reward signals for effective role-aware reasoning.<n>Our framework comprises three core designs: Cognitive Focus Reward, Reference-Guided Reward and Character-Conditioned Reward Normalization.
arXiv Detail & Related papers (2026-01-08T05:33:37Z) - Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification [35.41216970580546]
Trade-R1, a model training framework, bridges verifiable rewards to environments via process-level reasoning verification.<n>We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions.<n>Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking.
arXiv Detail & Related papers (2026-01-07T14:03:22Z) - PACR: Progressively Ascending Confidence Reward for LLM Reasoning [55.06373646059141]
We propose Progressively Ascending Confidence Reward (PACR)<n>PACR is a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer.<n>Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
arXiv Detail & Related papers (2025-10-25T11:25:35Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards [53.36917093757101]
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs)<n>We introduce textbfCogDual, a novel RPLA adopting a textitcognize-then-respond reasoning paradigm.<n>By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment.
arXiv Detail & Related papers (2025-07-23T02:26:33Z) - Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks [6.881699020319577]
We propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning Large Language Models (LLMs)<n>DRO is guided by a new reward signal: the Reasoning Reflection Reward (R3)<n>DRO consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.
arXiv Detail & Related papers (2025-06-16T10:43:38Z) - RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition [14.095215136905553]
We propose a novel descriptor for 3D Place Recognition, named RE-TRIP.<n>This new descriptor leverages both geometric measurements and reflectivity to enhance robustness.<n>We conduct a series of experiments to demonstrate the effectiveness of RE-TRIP.
arXiv Detail & Related papers (2025-05-22T03:11:30Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - Reward Models Identify Consistency, Not Causality [54.987590763737145]
State-of-the-art reward models prioritize structural consistency over causal correctness.<n>Removing the problem statement has minimal impact on reward scores.<n> altering numerical values or disrupting the reasoning flow significantly affects RM outputs.
arXiv Detail & Related papers (2025-02-20T14:57:14Z) - Out-of-Domain Generalization in Dynamical Systems Reconstruction [8.397468572544614]
We provide a formal framework that addresses generalization in DSR.
We show that black-box DL techniques, without adequate structural priors, generally will not be able to learn a generalizing DSR model.
arXiv Detail & Related papers (2024-02-28T14:52:58Z) - Explainable Session-based Recommendation via Path Reasoning [27.205463326317656]
We propose a hierarchical reinforcement learning framework for SR, which improves the explainability of existing SR models via Path Reasoning, namely PR4SR.
Considering the different importance of items to the session, we design the session-level agent to select the items in the session as the starting point for path reasoning and the path-level agent to perform path reasoning.
In particular, we design a multi-target reward mechanism to adapt to the skip behaviors of sequential patterns in SR, and introduce path midpoint reward to enhance the exploration efficiency in knowledge graphs.
arXiv Detail & Related papers (2024-02-28T12:11:08Z) - Single-Reset Divide & Conquer Imitation Learning [49.87201678501027]
Demonstrations are commonly used to speed up the learning process of Deep Reinforcement Learning algorithms.
Some algorithms have been developed to learn from a single demonstration.
arXiv Detail & Related papers (2024-02-14T17:59:47Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - ASR: Attention-alike Structural Re-parameterization [53.019657810468026]
We propose a simple-yet-effective attention-alike structural re- parameterization (ASR) that allows us to achieve SRP for a given network while enjoying the effectiveness of the attention mechanism.
In this paper, we conduct extensive experiments from a statistical perspective and discover an interesting phenomenon Stripe Observation, which reveals that channel attention values quickly approach some constant vectors during training.
arXiv Detail & Related papers (2023-04-13T08:52:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.