When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?
- URL: http://arxiv.org/abs/2412.13662v1
- Date: Wed, 18 Dec 2024 09:39:12 GMT
- Title: When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?
- Authors: Tongzhou Mu, Zhaoyang Li, Stanisław Wiktor Strzelecki, Xiu Yuan, Yunchao Yao, Litian Liang, Hao Su,
- Abstract summary: This study conducts an empirical comparison of State-to-Visual DAgger, a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy, and Visual RL.
Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance.
- Score: 19.13734020392486
- License:
- Abstract: Learning policies from high-dimensional visual inputs, such as pixels and point clouds, is crucial in various applications. Visual reinforcement learning is a promising approach that directly trains policies from visual observations, although it faces challenges in sample efficiency and computational costs. This study conducts an empirical comparison of State-to-Visual DAgger, a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy, and Visual RL across a diverse set of tasks. We evaluate both methods across 16 tasks from three benchmarks, focusing on their asymptotic performance, sample efficiency, and computational costs. Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance. In contrast, its benefits in sample efficiency are less pronounced, although it often reduces the overall wall-clock time required for training. Based on our findings, we provide recommendations for practitioners and hope that our results contribute valuable perspectives for future research in visual policy learning.
Related papers
- Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.
They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.
We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - DMC-VB: A Benchmark for Representation Learning for Control with Visual Distractors [13.700885996266457]
Learning from previously collected data via behavioral cloning or offline reinforcement learning (RL) is a powerful recipe for scaling generalist agents.
We present theDeepMind Control Visual Benchmark (DMC-VB), a dataset collected in the DeepMind Control Suite to evaluate the robustness of offline RL agents.
Accompanying our dataset, we propose three benchmarks to evaluate representation learning methods for pretraining, and carry out experiments on several recently proposed methods.
arXiv Detail & Related papers (2024-09-26T23:07:01Z) - Pretrained Visual Representations in Reinforcement Learning [0.0]
This paper compares the performance of visual reinforcement learning algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs)
We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC)
arXiv Detail & Related papers (2024-07-24T12:53:26Z) - Bootstrapping Reinforcement Learning with Imitation for Vision-Based Agile Flight [20.92646531472541]
We propose a novel approach that combines the performance of Reinforcement Learning (RL) and the sample efficiency of Imitation Learning (IL)
Our framework contains three phases teacher policy using RL with privileged state information distilling it into a student policy via IL, and adaptive fine-tuning via RL.
Tests show our approach can not only learn in scenarios where RL from scratch fails but also outperforms existing IL methods in both robustness and performance.
arXiv Detail & Related papers (2024-03-18T19:25:57Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Offline Visual Representation Learning for Embodied Navigation [50.442660137987275]
offline pretraining of visual representations with self-supervised learning (SSL)
Online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.
arXiv Detail & Related papers (2022-04-27T23:22:43Z) - VL-LTR: Learning Class-wise Visual-Linguistic Representation for
Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR.
Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions.
Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z) - Seeking Visual Discomfort: Curiosity-driven Representations for
Reinforcement Learning [12.829056201510994]
We present an approach to improve sample diversity for state representation learning.
Our proposed approach boosts the visitation of problematic states, improves the learned state representation, and outperforms the baselines for all tested environments.
arXiv Detail & Related papers (2021-10-02T11:15:04Z) - Making Curiosity Explicit in Vision-based RL [12.829056201510994]
Vision-based reinforcement learning (RL) is a promising technique to solve control tasks involving images as the main observation.
State-of-the-art RL algorithms still struggle in terms of sample efficiency.
We present an approach to improve the sample diversity.
arXiv Detail & Related papers (2021-09-28T09:50:37Z) - Heterogeneous Contrastive Learning: Encoding Spatial Information for
Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations.
We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z) - Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training.
We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss.
We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.