GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic
Manipulation
- URL: http://arxiv.org/abs/2307.05963v1
- Date: Wed, 12 Jul 2023 07:12:20 GMT
- Title: GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic
Manipulation
- Authors: Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Suyeon Shin, Byoung-Tak Zhang
- Abstract summary: Grounding Vision to Ceaselessly Created Instructions (GVCCI) is a lifelong learning framework for Language-Guided Robotic Manipulation (LGRM)
GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data.
Experimental results show that GVCCI leads to a steady improvement in VG by up to 56.7% and improves LGRM by up to 29.4%.
- Score: 20.041507826568093
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language-Guided Robotic Manipulation (LGRM) is a challenging task as it
requires a robot to understand human instructions to manipulate everyday
objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG)
models to detect objects without adapting to manipulation environments. This
results in a performance drop due to a substantial domain gap between the
pre-training and real-world data. A straightforward solution is to collect
additional training data, but the cost of human-annotation is extortionate. In
this paper, we propose Grounding Vision to Ceaselessly Created Instructions
(GVCCI), a lifelong learning framework for LGRM, which continuously learns VG
without human supervision. GVCCI iteratively generates synthetic instruction
via object detection and trains the VG model with the generated data. We
validate our framework in offline and online settings across diverse
environments on different VG models. Experimental results show that
accumulating synthetic data from GVCCI leads to a steady improvement in VG by
up to 56.7% and improves resultant LGRM by up to 29.4%. Furthermore, the
qualitative analysis shows that the unadapted VG model often fails to find
correct objects due to a strong bias learned from the pre-training data.
Finally, we introduce a novel VG dataset for LGRM, consisting of nearly 252k
triplets of image-object-instruction from diverse manipulation environments.
Related papers
- Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner.
Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations.
We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z) - Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions [36.851214751652996]
We propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions.
Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration.
arXiv Detail & Related papers (2024-02-17T12:42:14Z) - Visual Geo-localization with Self-supervised Representation Learning [8.642591824865892]
We present a novel unified VG-SSL framework with the goal to enhance performance and training efficiency on a large Visual Geo-localization dataset.
Our work incorporates multiple SSL methods tailored for VG: SimCLR, MoCov2, BYOL, SimSiam, Barlow Twins, and VICReg.
arXiv Detail & Related papers (2023-07-31T19:03:13Z) - Iterative Robust Visual Grounding with Masked Reference based
Centerpoint Supervision [24.90534567531536]
We propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS)
The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets.
arXiv Detail & Related papers (2023-07-23T17:55:24Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z) - Improving the Performance of Fine-Grain Image Classifiers via Generative
Data Augmentation [0.5161531917413706]
We develop Data Augmentation from Proficient Pre-Training of Robust Generative Adrial Networks (DAPPER GAN)
DAPPER GAN is an ML analytics support tool that automatically generates novel views of training images.
We experimentally evaluate this technique on the Stanford Cars dataset, demonstrating improved vehicle make and model classification accuracy.
arXiv Detail & Related papers (2020-08-12T15:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.