See, Plan, Predict: Language-guided Cognitive Planning with Video
Prediction
- URL: http://arxiv.org/abs/2210.03825v1
- Date: Fri, 7 Oct 2022 21:27:16 GMT
- Title: See, Plan, Predict: Language-guided Cognitive Planning with Video
Prediction
- Authors: Maria Attarian, Advaya Gupta, Ziyi Zhou, Wei Yu, Igor Gilitschenski,
Animesh Garg
- Abstract summary: We devise a cognitive planning algorithm via language-guided video prediction.
The network is endowed with the ability to ground concepts based on natural language input with generalization to unseen objects.
- Score: 27.44435424335596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cognitive planning is the structural decomposition of complex tasks into a
sequence of future behaviors. In the computational setting, performing
cognitive planning entails grounding plans and concepts in one or more
modalities in order to leverage them for low level control. Since real-world
tasks are often described in natural language, we devise a cognitive planning
algorithm via language-guided video prediction. Current video prediction models
do not support conditioning on natural language instructions. Therefore, we
propose a new video prediction architecture which leverages the power of
pre-trained transformers.The network is endowed with the ability to ground
concepts based on natural language input with generalization to unseen objects.
We demonstrate the effectiveness of this approach on a new simulation dataset,
where each task is defined by a high-level action described in natural
language. Our experiments compare our method again stone video generation
baseline without planning or action grounding and showcase significant
improvements. Our ablation studies highlight an improved generalization to
unseen objects that natural language embeddings offer to concept grounding
ability, as well as the importance of planning towards visual "imagination" of
a task.
Related papers
- Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities.
Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans.
We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z) - How language models extrapolate outside the training data: A case study in Textualized Gridworld [32.5268320198854]
We show that conventional approaches, including next-token prediction and Chain of Thought fine-tuning, fail to generalize in larger, unseen environments.
Inspired by human cognition and dual-process theory, we propose language models should construct cognitive maps before interaction.
arXiv Detail & Related papers (2024-06-21T16:10:05Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Integrating AI Planning with Natural Language Processing: A Combination
of Explicit and Tacit Knowledge [15.488154564562185]
This paper outlines the commons and relations between AI planning and natural language processing.
It argues that each of them can effectively impact on the other one by five areas: (1) planning-based text understanding, (2) planning-based natural language processing, (3) planning-based explainability, (4) text-based human-robot interaction, and (5) applications.
arXiv Detail & Related papers (2022-02-15T02:19:09Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.