DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation
- URL: http://arxiv.org/abs/2509.04970v1
- Date: Fri, 05 Sep 2025 09:52:08 GMT
- Title: DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation
- Authors: Tien Pham, Xinyun Chi, Khang Nguyen, Manfred Huber, Angelo Cangelosi,
- Abstract summary: This paper introduces DeGuV, an RL framework that enhances both generalization and sample efficiency.<n>We leverage a learnable masker network that produces a mask from the depth input, preserving only critical visual information while discarding irrelevant pixels.<n>In addition, we incorporate contrastive learning and stabilize Q-value estimation under augmentation to further enhance sample efficiency and training stability.
- Score: 3.694734526301468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) agents can learn to solve complex tasks from visual inputs, but generalizing these learned skills to new environments remains a major challenge in RL application, especially robotics. While data augmentation can improve generalization, it often compromises sample efficiency and training stability. This paper introduces DeGuV, an RL framework that enhances both generalization and sample efficiency. In specific, we leverage a learnable masker network that produces a mask from the depth input, preserving only critical visual information while discarding irrelevant pixels. Through this, we ensure that our RL agents focus on essential features, improving robustness under data augmentation. In addition, we incorporate contrastive learning and stabilize Q-value estimation under augmentation to further enhance sample efficiency and training stability. We evaluate our proposed method on the RL-ViGen benchmark using the Franka Emika robot and demonstrate its effectiveness in zero-shot sim-to-real transfer. Our results show that DeGuV outperforms state-of-the-art methods in both generalization and sample efficiency while also improving interpretability by highlighting the most relevant regions in the visual input
Related papers
- Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation [60.04281435591454]
CRDA (Curriculum Reinforcement-Learning Data Augmentation) is a novel framework guiding detectors to progressively master multi-domain forgery features.<n>Central to our approach is integrating reinforcement learning and causal inference.<n>Our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
arXiv Detail & Related papers (2025-11-10T12:45:52Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models [22.10168313140081]
We introduce ERL-VLM, an enhanced rating-based reinforcement learning method that learns reward functions from AI feedback.<n>ERL-VLM queries large vision-language models for absolute ratings of individual trajectories, enabling more expressive feedback.<n>We demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods.
arXiv Detail & Related papers (2025-06-15T12:05:08Z) - Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals [49.17123504516502]
CurrentReinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from inefficiency due to redundant exposure of identical queries under uniform data sampling.<n>We propose a Gradient-driven Angle-Informed Navigated RL framework.<n>By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates.
arXiv Detail & Related papers (2025-06-02T21:40:38Z) - Diffusion Guidance Is a Controllable Policy Improvement Operator [98.11511661904618]
CFGRL is trained with the simplicity of supervised learning, yet can further improve on the policies in the data.<n>On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance.
arXiv Detail & Related papers (2025-05-29T14:06:50Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - CCLF: A Contrastive-Curiosity-Driven Learning Framework for
Sample-Efficient Reinforcement Learning [56.20123080771364]
We develop a model-agnostic Contrastive-Curiosity-Driven Learning Framework (CCLF) for reinforcement learning.
CCLF fully exploit sample importance and improve learning efficiency in a self-supervised manner.
We evaluate this approach on the DeepMind Control Suite, Atari, and MiniGrid benchmarks.
arXiv Detail & Related papers (2022-05-02T14:42:05Z) - Don't Touch What Matters: Task-Aware Lipschitz Data Augmentation for
Visual Reinforcement Learning [27.205521177841568]
We propose Task-aware Lipschitz Data Augmentation (TLDA) for visual Reinforcement Learning (RL)
TLDA explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels.
It outperforms previous state-of-the-art methods across the 3 different visual control benchmarks.
arXiv Detail & Related papers (2022-02-21T04:22:07Z) - Mask-based Latent Reconstruction for Reinforcement Learning [58.43247393611453]
Mask-based Latent Reconstruction (MLR) is proposed to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels.
Extensive experiments show that our MLR significantly improves the sample efficiency in deep reinforcement learning.
arXiv Detail & Related papers (2022-01-28T13:07:11Z) - Seeking Visual Discomfort: Curiosity-driven Representations for
Reinforcement Learning [12.829056201510994]
We present an approach to improve sample diversity for state representation learning.
Our proposed approach boosts the visitation of problematic states, improves the learned state representation, and outperforms the baselines for all tested environments.
arXiv Detail & Related papers (2021-10-02T11:15:04Z) - Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under
Data Augmentation [25.493902939111265]
We investigate causes of instability when using data augmentation in off-policy Reinforcement Learning algorithms.
We propose a simple yet effective technique for stabilizing this class of algorithms under augmentation.
Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL.
arXiv Detail & Related papers (2021-07-01T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.