Related papers: Balancing Interpretability and Performance in Reinforcement Learning: An Adaptive Spectral Based Linear Approach

Balancing Interpretability and Performance in Reinforcement Learning: An Adaptive Spectral Based Linear Approach

URL: http://arxiv.org/abs/2510.03722v1
Date: Sat, 04 Oct 2025 07:53:43 GMT
Title: Balancing Interpretability and Performance in Reinforcement Learning: An Adaptive Spectral Based Linear Approach
Authors: Qianxin Yi, Shao-Bo Lin, Jun Fan, Yao Wang,
Abstract summary: Reinforcement learning (RL) has been widely applied to sequential decision making.<n>Current approaches typically focus on performance and rely on post hoc explanations to account for interpretability.<n>We propose a spectral based linear RL method that extends the ridge regression-based approach through a spectral filter function.
Score: 15.065437093352054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) has been widely applied to sequential decision making, where interpretability and performance are both critical for practical adoption. Current approaches typically focus on performance and rely on post hoc explanations to account for interpretability. Different from these approaches, we focus on designing an interpretability-oriented yet performance-enhanced RL approach. Specifically, we propose a spectral based linear RL method that extends the ridge regression-based approach through a spectral filter function. The proposed method clarifies the role of regularization in controlling estimation error and further enables the design of an adaptive regularization parameter selection strategy guided by the bias-variance trade-off principle. Theoretical analysis establishes near-optimal bounds for both parameter estimation and generalization error. Extensive experiments on simulated environments and real-world datasets from Kuaishou and Taobao demonstrate that our method either outperforms or matches existing baselines in decision quality. We also conduct interpretability analyses to illustrate how the learned policies make decisions, thereby enhancing user trust. These results highlight the potential of our approach to bridge the gap between RL theory and practical decision making, providing interpretability, accuracy, and adaptability in management contexts.

Related papers

Towards regularized learning from functional data with covariate shift [3.072411352294816]
This paper investigates a general regularization framework for unsupervised domain adaptation in vector-valued regression.<n>By restricting the hypothesis space, we develop a practical operator learning algorithm capable of handling functional outputs.
arXiv Detail & Related papers (2026-01-28T20:30:05Z)
A Comedy of Estimators: On KL Regularization in RL Training of LLMs [81.7906270099878]
reinforcement learning (RL) can substantially improve the reasoning performance of large language models (LLMs)<n>The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy.<n>Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation.<n>We study the gradients of several estimators configurations, revealing how design choices shape gradient bias.
arXiv Detail & Related papers (2025-12-26T04:20:58Z)
OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning [12.77713716713937]
We provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators.<n>We derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients.<n>We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction.
arXiv Detail & Related papers (2025-11-28T16:09:28Z)
Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z)
Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation [10.35045003737115]
Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ.<n>We propose DR-RPO, a model-free online policy optimization method that learns robust policies with sublinear regret.<n>We show that DR-RPO can achieve suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches.
arXiv Detail & Related papers (2025-10-16T02:56:58Z)
Observations Meet Actions: Learning Control-Sufficient Representations for Robust Policy Generalization [6.408943565801689]
Capturing latent variations ("contexts") is key to deploying reinforcement-learning (RL) agents beyond their training regime.<n>We recast context-based RL as a dual inference-control problem and formally characterize two properties and their hierarchy.<n>We derive a contextual evidence lower bound(ELBO)-style objective that cleanly separates representation learning from policy learning.
arXiv Detail & Related papers (2025-07-25T17:08:16Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning [8.087699764574788]
We propose an efficient algorithm for offline preference-based reinforcement learning (PbRL)<n>APPO guarantees sample complexity bounds without relying on explicit confidence sets.<n>To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability.
arXiv Detail & Related papers (2025-03-07T10:35:01Z)
Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. We propose a single framework built on their equivalence in learning scenarios. Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Probabilistic Constrained Reinforcement Learning with Formal Interpretability [2.990411348977783]
We propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. In comparison with state-of-theart benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
arXiv Detail & Related papers (2023-07-13T22:52:22Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.