CI w/o TN: Context Injection without Task Name for Procedure Planning
- URL: http://arxiv.org/abs/2402.15579v1
- Date: Fri, 23 Feb 2024 19:34:47 GMT
- Title: CI w/o TN: Context Injection without Task Name for Procedure Planning
- Authors: Xinjie Li
- Abstract summary: Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
- Score: 4.004155037293416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores the challenge of procedure planning in instructional
videos, which involves creating goal-directed plans based on visual start and
goal observations from videos. Previous research has tackled this problem with
gradually weaker training supervision, from heavy intermediate visual
observations or language instructions to task class supervision. However, with
the advent of large language models, even given only the task name, these
models can produce a detailed plan. In this study, we propose a much weaker
setting without task name as supervision, which is not currently solvable by
existing large language models since they require good prompts with sufficient
information. Specifically, we hypothesize that previous intermediate
supervisions can serve as context information, and we use captions of visual
start and goal observations as a much cheaper form of supervision. This
approach greatly reduces the labeling cost since the captions can be easily
obtained by large pre-trained vision-language foundation models. Technically,
we apply BLIP to generate captions as supervision to train the context feature
with contrastive learning loss. Afterward, the context feature is fed into the
generator to aid in plan generation. Our experiments on two datasets with
varying scales demonstrate that our model can achieve comparable performance on
multiple metrics, which validates our hypothesis.
Related papers
- PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset [0.0]
We present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow.
It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal.
Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios.
arXiv Detail & Related papers (2024-03-05T18:01:59Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Pretrained Language Models as Visual Planners for Human Assistance [12.8775186900555]
Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
arXiv Detail & Related papers (2023-04-17T18:07:36Z) - PDPP:Projected Diffusion for Procedure Planning in Instructional Videos [30.637651835289635]
We study the problem of procedure planning in instructional videos.
This problem aims to make goal-directed plans given the current visual observations in unstructured real-life videos.
arXiv Detail & Related papers (2023-03-26T10:50:16Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.