Video Language Planning
- URL: http://arxiv.org/abs/2310.10625v1
- Date: Mon, 16 Oct 2023 17:48:45 GMT
- Title: Video Language Planning
- Authors: Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian
Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum,
Leslie Kaelbling, Andy Zeng, Jonathan Tompson
- Abstract summary: Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
- Score: 137.06052217713054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We are interested in enabling visual planning for complex long-horizon tasks
in the space of generated videos and language, leveraging recent advances in
large generative models pretrained on Internet-scale data. To this end, we
present video language planning (VLP), an algorithm that consists of a tree
search procedure, where we train (i) vision-language models to serve as both
policies and value functions, and (ii) text-to-video models as dynamics models.
VLP takes as input a long-horizon task instruction and current image
observation, and outputs a long video plan that provides detailed multimodal
(video and language) specifications that describe how to complete the final
task. VLP scales with increasing computation budget where more computation time
results in improved video plans, and is able to synthesize long-horizon video
plans across different robotics domains: from multi-object rearrangement, to
multi-camera bi-arm dexterous manipulation. Generated video plans can be
translated into real robot actions via goal-conditioned policies, conditioned
on each intermediate frame of the generated video. Experiments show that VLP
substantially improves long-horizon task success rates compared to prior
methods on both simulated and real robots (across 3 hardware platforms).
Related papers
- PiTe: Pixel-Temporal Alignment for Large Video-Language Model [40.76776645042339]
Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature.
We propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property.
arXiv Detail & Related papers (2024-09-11T12:53:07Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio.
VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Pretrained Language Models as Visual Planners for Human Assistance [12.8775186900555]
Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
arXiv Detail & Related papers (2023-04-17T18:07:36Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.