Off-policy Imitation Learning from Visual Inputs
- URL: http://arxiv.org/abs/2111.04345v1
- Date: Mon, 8 Nov 2021 09:06:12 GMT
- Title: Off-policy Imitation Learning from Visual Inputs
- Authors: Zhihao Cheng, Li Shen, Dacheng Tao
- Abstract summary: We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques.
We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
- Score: 83.22342811160114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, various successful applications utilizing expert states in
imitation learning (IL) have been witnessed. However, another IL setting -- IL
from visual inputs (ILfVI), which has a greater promise to be applied in
reality by utilizing online visual resources, suffers from low data-efficiency
and poor performance resulted from an on-policy learning manner and
high-dimensional visual inputs. We propose OPIfVI (Off-Policy Imitation from
Visual Inputs), which is composed of an off-policy learning manner, data
augmentation, and encoder techniques, to tackle the mentioned challenges,
respectively. More specifically, to improve data-efficiency, OPIfVI conducts IL
in an off-policy manner, with which sampled data can be used multiple times. In
addition, we enhance the stability of OPIfVI with spectral normalization to
mitigate the side-effect of off-policy training. The core factor, contributing
to the poor performance of ILfVI, that we think is the agent could not extract
meaningful features from visual inputs. Hence, OPIfVI employs data augmentation
from computer vision to help train encoders that can better extract features
from visual inputs. In addition, a specific structure of gradient
backpropagation for the encoder is designed to stabilize the encoder training.
At last, we demonstrate that OPIfVI is able to achieve expert-level performance
and outperform existing baselines no matter visual demonstrations or visual
observations are provided via extensive experiments using DeepMind Control
Suite.
Related papers
- Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Can Contrastive Learning Refine Embeddings [7.212172283470726]
SIMSKIP is a contrastive learning framework that specifically refines input embeddings for downstream tasks.
We show that SIMSKIP does not result in larger upper bounds on downstream task errors than those of the original embeddings.
arXiv Detail & Related papers (2024-04-11T01:16:33Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - PerceptionGPT: Effectively Fusing Visual Perception into LLM [31.34127196055722]
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs)
We present a novel end-to-end framework named PerceptionGPT, which efficiently equips the VLLMs with visual perception abilities.
Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens.
arXiv Detail & Related papers (2023-11-11T16:59:20Z) - BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently.
We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction.
This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z) - Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer [29.548242447584194]
We propose a two-stage framework for learning from visual observation.
In the first stage, we pretrain State-to-Go Transformer offline to predict and differentiate latent transitions of demonstrations.
In the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks.
arXiv Detail & Related papers (2023-06-22T13:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.