Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization
- URL: http://arxiv.org/abs/2511.16090v1
- Date: Thu, 20 Nov 2025 06:31:55 GMT
- Title: Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization
- Authors: Haohui Chen, Zhiyong Chen, Aoxiang Liu, Wentuo Fang,
- Abstract summary: This work introduces enhanced methods to achieve flexible bias control and stronger representation learning.<n>We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors.<n>To further improve performance, we integrate augmented state and action representations into the actor and critic networks.
- Score: 4.784045060345404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
Related papers
- Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty [0.0]
Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor)<n>Current methods employ ensembling to quantify the critic's uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates.<n>In this work, we propose a new algorithm called Actor-C (STAC) that incorporates temporal (one) aleatoric uncertainty-uncertainty arising from transitions, rewards, and policy-induced variability in Bellman.
arXiv Detail & Related papers (2026-01-02T16:33:17Z) - Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z) - RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation [67.38036090822982]
We propose RoboView-Bias, the first benchmark specifically designed to quantify visual bias in robotic manipulation.<n>We create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions.<n>Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
arXiv Detail & Related papers (2025-09-26T13:53:25Z) - Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning [0.5204229323525671]
We present a counterfactual reward model that introduces causal inference with multimodal representation learning to provide an unsupervised, bias-resilient reward signal.<n>We evaluated the framework on a multimodal fake versus true news dataset, which exhibits framing bias, class imbalance, and distributional drift.<n>The resulting system achieved an accuracy of 89.12% in fake news detection, outperforming the baseline reward models.
arXiv Detail & Related papers (2025-08-27T04:54:33Z) - Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z) - Spectral Representation for Causal Estimation with Hidden Confounders [33.148766692274215]
We address the problem of causal effect estimation where hidden confounders are present.<n>Our approach uses a singular value decomposition of a conditional expectation operator, followed by a saddle-point optimization problem.
arXiv Detail & Related papers (2024-07-15T05:39:56Z) - Perturbation-Invariant Adversarial Training for Neural Ranking Models:
Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents.
This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs.
In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z) - Ensembling over Classifiers: a Bias-Variance Perspective [13.006468721874372]
We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers.
We show that conditional estimates necessarily incur an irreducible error.
Empirically, standard ensembling reducesthe bias, leading us to hypothesize that ensembles of classifiers may perform well in part because of this unexpected reduction.
arXiv Detail & Related papers (2022-06-21T17:46:35Z) - Efficient Continuous Control with Double Actors and Regularized Critics [7.072664211491016]
We explore the potential of double actors, which has been neglected for a long time, for better value function estimation in continuous setting.
We build double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively.
To mitigate the uncertainty of value estimate from double critics, we propose to regularize the critic networks under double actors architecture.
arXiv Detail & Related papers (2021-06-06T07:04:48Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - Controlling Overestimation Bias with Truncated Mixture of Continuous
Distributional Quantile Critics [65.51757376525798]
Overestimation bias is one of the major impediments to accurate off-policy learning.
This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting.
Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics.
arXiv Detail & Related papers (2020-05-08T19:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.