Distilling Internet-Scale Vision-Language Models into Embodied Agents
- URL: http://arxiv.org/abs/2301.12507v2
- Date: Wed, 14 Jun 2023 14:04:50 GMT
- Title: Distilling Internet-Scale Vision-Language Models into Embodied Agents
- Authors: Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita
Dasgupta
- Abstract summary: We propose using pretrained vision-language models (VLMs) to supervise embodied agents.
We combine ideas from model distillation and hindsight experience replay (HER) to retroactively generate language describing the agent's behavior.
Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
- Score: 24.71298634838615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction-following agents must ground language into their observation and
action spaces. Learning to ground language is challenging, typically requiring
domain-specific engineering or large quantities of human interaction data. To
address this challenge, we propose using pretrained vision-language models
(VLMs) to supervise embodied agents. We combine ideas from model distillation
and hindsight experience replay (HER), using a VLM to retroactively generate
language describing the agent's behavior. Simple prompting allows us to control
the supervision signal, teaching an agent to interact with novel objects based
on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered
environment. Fewshot prompting lets us teach abstract category membership,
including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary
preferences over objects). Our work outlines a new and effective way to use
internet-scale VLMs, repurposing the generic language grounding acquired by
such models to teach task-relevant groundings to embodied agents.
Related papers
- Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location.
This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens.
ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.