Distilling Internet-Scale Vision-Language Models into Embodied Agents
- URL: http://arxiv.org/abs/2301.12507v2
- Date: Wed, 14 Jun 2023 14:04:50 GMT
- Title: Distilling Internet-Scale Vision-Language Models into Embodied Agents
- Authors: Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita
Dasgupta
- Abstract summary: We propose using pretrained vision-language models (VLMs) to supervise embodied agents.
We combine ideas from model distillation and hindsight experience replay (HER) to retroactively generate language describing the agent's behavior.
Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
- Score: 24.71298634838615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction-following agents must ground language into their observation and
action spaces. Learning to ground language is challenging, typically requiring
domain-specific engineering or large quantities of human interaction data. To
address this challenge, we propose using pretrained vision-language models
(VLMs) to supervise embodied agents. We combine ideas from model distillation
and hindsight experience replay (HER), using a VLM to retroactively generate
language describing the agent's behavior. Simple prompting allows us to control
the supervision signal, teaching an agent to interact with novel objects based
on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered
environment. Fewshot prompting lets us teach abstract category membership,
including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary
preferences over objects). Our work outlines a new and effective way to use
internet-scale VLMs, repurposing the generic language grounding acquired by
such models to teach task-relevant groundings to embodied agents.
Related papers
- PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model [4.079327215055764]
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world.
Visual Language Models (VLMs) have excelled in high-level reasoning but fall short in grasping the nuanced physical properties required for effective human-robot interaction.
We introduce PAVLM, an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud.
arXiv Detail & Related papers (2024-10-15T12:53:42Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location.
This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens.
ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.