Related papers: Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

URL: http://arxiv.org/abs/2601.15761v1
Date: Thu, 22 Jan 2026 08:51:16 GMT
Title: Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning
Authors: Xiefeng Wu, Mingyu Hu, Shu Zhang,
Abstract summary: We introduce SigEnt-SAC, an off-policy actor-critic method that learns from scratch using a single expert trajectory.<n>SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100% success rate faster than prior methods.
Score: 1.6836220990645554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.

Related papers

Scaling Agent Learning via Experience Synthesis [100.42712232390532]
Reinforcement learning can empower autonomous agents by enabling self-improvement through interaction.<n>But its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity.<n>We introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind.
arXiv Detail & Related papers (2025-11-05T18:58:48Z)
Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings [4.446853669417819]
Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback.<n>We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent.
arXiv Detail & Related papers (2025-10-28T14:01:13Z)
Residual Off-Policy RL for Finetuning Behavior Cloning Policies [41.99435186991878]
We present a recipe that combines the benefits of behavior cloning (BC) and reinforcement learning (RL) through a residual learning framework.<n>Our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems.<n>In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.
arXiv Detail & Related papers (2025-09-23T17:59:46Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL [41.254970515368335]
Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators.<n>While reinforcement learning holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging.<n>This paper introduces SLAC, a method that renders real-world RL feasible for complex embodiments.
arXiv Detail & Related papers (2025-06-04T16:41:55Z)
REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous Manipulation [61.7171775202833]
We introduce an efficient system for learning dexterous manipulation skills withReinforcement learning. The main idea of our approach is the integration of recent advances in sample-efficient RL and replay buffer bootstrapping. Our system completes the real-world training cycle by incorporating learned resets via an imitation-based pickup policy.
arXiv Detail & Related papers (2023-09-06T19:05:31Z)
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z)
Maximum Entropy Heterogeneous-Agent Reinforcement Learning [45.377385280485065]
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years.<n>We propose a unified framework for learning policies to resolve issues related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium.<n>Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm.<n>We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game.
arXiv Detail & Related papers (2023-06-19T06:22:02Z)
Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation [72.24964965882783]
Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error.<n>Real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies.<n>We introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function.
arXiv Detail & Related papers (2023-06-09T18:45:15Z)
Learning Dexterous Manipulation from Suboptimal Experts [69.8017067648129]
Relative Entropy Q-Learning (REQ) is a simple policy algorithm that combines ideas from successful offline and conventional RL algorithms. We show how REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.
arXiv Detail & Related papers (2020-10-16T18:48:49Z)
Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments. We take this idea one step further by artificially bandlimiting the target critic spatial resolution. We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z)
Meta-Reinforcement Learning for Robotic Industrial Insertion Tasks [70.56451186797436]
We study how to use meta-reinforcement learning to solve the bulk of the problem in simulation. We demonstrate our approach by training an agent to successfully perform challenging real-world insertion tasks.
arXiv Detail & Related papers (2020-04-29T18:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.