Off-policy Imitation Learning from Visual Inputs
- URL: http://arxiv.org/abs/2111.04345v1
- Date: Mon, 8 Nov 2021 09:06:12 GMT
- Title: Off-policy Imitation Learning from Visual Inputs
- Authors: Zhihao Cheng, Li Shen, Dacheng Tao
- Abstract summary: We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques.
We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
- Score: 83.22342811160114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, various successful applications utilizing expert states in
imitation learning (IL) have been witnessed. However, another IL setting -- IL
from visual inputs (ILfVI), which has a greater promise to be applied in
reality by utilizing online visual resources, suffers from low data-efficiency
and poor performance resulted from an on-policy learning manner and
high-dimensional visual inputs. We propose OPIfVI (Off-Policy Imitation from
Visual Inputs), which is composed of an off-policy learning manner, data
augmentation, and encoder techniques, to tackle the mentioned challenges,
respectively. More specifically, to improve data-efficiency, OPIfVI conducts IL
in an off-policy manner, with which sampled data can be used multiple times. In
addition, we enhance the stability of OPIfVI with spectral normalization to
mitigate the side-effect of off-policy training. The core factor, contributing
to the poor performance of ILfVI, that we think is the agent could not extract
meaningful features from visual inputs. Hence, OPIfVI employs data augmentation
from computer vision to help train encoders that can better extract features
from visual inputs. In addition, a specific structure of gradient
backpropagation for the encoder is designed to stabilize the encoder training.
At last, we demonstrate that OPIfVI is able to achieve expert-level performance
and outperform existing baselines no matter visual demonstrations or visual
observations are provided via extensive experiments using DeepMind Control
Suite.
Related papers
- Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.
Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.
We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.
We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z) - Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.
We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.
We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - Efficient Reinforcement Learning Through Adaptively Pretrained Visual Encoder [12.310140622800372]
We propose APE: efficient reinforcement learning through Adaptively Pretrained visual.
APE uses adaptive augmentation strategy during the pretraining phase and extracts generalizable features with only a few interactions within the task environments in the policy learning period.
Results show that mainstream RL methods, such as DreamerV3 and DrQ-v2, achieve state-of-the-art performance when equipped with APE.
arXiv Detail & Related papers (2025-02-08T12:57:02Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - PerceptionGPT: Effectively Fusing Visual Perception into LLM [31.34127196055722]
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs)
We present a novel end-to-end framework named PerceptionGPT, which efficiently equips the VLLMs with visual perception abilities.
Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens.
arXiv Detail & Related papers (2023-11-11T16:59:20Z) - Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer [29.548242447584194]
We propose a two-stage framework for learning from visual observation.
In the first stage, we pretrain State-to-Go Transformer offline to predict and differentiate latent transitions of demonstrations.
In the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks.
arXiv Detail & Related papers (2023-06-22T13:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.