Pretrained Language Models as Visual Planners for Human Assistance
- URL: http://arxiv.org/abs/2304.09179v3
- Date: Sat, 26 Aug 2023 06:22:41 GMT
- Title: Pretrained Language Models as Visual Planners for Human Assistance
- Authors: Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis
Iuzzolino, Unnat Jain, Ruta Desai
- Abstract summary: Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
- Score: 12.8775186900555
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In our pursuit of advancing multi-modal AI assistants capable of guiding
users to achieve complex multi-step goals, we propose the task of "Visual
Planning for Assistance (VPA)". Given a succinct natural language goal, e.g.,
"make a shelf", and a video of the user's progress so far, the aim of VPA is to
devise a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf",
etc. to realize the specified goal. This requires assessing the user's progress
from the (untrimmed) video, and relating it to the requirements of natural
language goal, i.e., which actions to select and in what order? Consequently,
this requires handling long video history and arbitrarily complex action
dependencies. To address these challenges, we decompose VPA into video action
segmentation and forecasting. Importantly, we experiment by formulating the
forecasting step as a multi-modal sequence modeling problem, allowing us to
leverage the strength of pre-trained LMs (as the sequence model). This novel
approach, which we call Visual Language Model based Planner (VLaMP),
outperforms baselines across a suite of metrics that gauge the quality of the
generated plans. Furthermore, through comprehensive ablations, we also isolate
the value of each component--language pre-training, visual observations, and
goal information. We have open-sourced all the data, model checkpoints, and
training code.
Related papers
- Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos.
It employs a breadth-first search algorithm for optimal plan generation.
Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z) - PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset [0.0]
We present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow.
It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal.
Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios.
arXiv Detail & Related papers (2024-03-05T18:01:59Z) - CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.