CI w/o TN: Context Injection without Task Name for Procedure Planning
- URL: http://arxiv.org/abs/2402.15579v1
- Date: Fri, 23 Feb 2024 19:34:47 GMT
- Title: CI w/o TN: Context Injection without Task Name for Procedure Planning
- Authors: Xinjie Li
- Abstract summary: Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
- Score: 4.004155037293416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores the challenge of procedure planning in instructional
videos, which involves creating goal-directed plans based on visual start and
goal observations from videos. Previous research has tackled this problem with
gradually weaker training supervision, from heavy intermediate visual
observations or language instructions to task class supervision. However, with
the advent of large language models, even given only the task name, these
models can produce a detailed plan. In this study, we propose a much weaker
setting without task name as supervision, which is not currently solvable by
existing large language models since they require good prompts with sufficient
information. Specifically, we hypothesize that previous intermediate
supervisions can serve as context information, and we use captions of visual
start and goal observations as a much cheaper form of supervision. This
approach greatly reduces the labeling cost since the captions can be easily
obtained by large pre-trained vision-language foundation models. Technically,
we apply BLIP to generate captions as supervision to train the context feature
with contrastive learning loss. Afterward, the context feature is fed into the
generator to aid in plan generation. Our experiments on two datasets with
varying scales demonstrate that our model can achieve comparable performance on
multiple metrics, which validates our hypothesis.
Related papers
- Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization [77.36122979882649]
Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP)
In this paper, we explore the idea that CV adopts discrete and terminological task definitions, which may be a key barrier to zero-shot task generalization.
Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks.
arXiv Detail & Related papers (2024-12-24T16:08:25Z) - Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
Building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues.
We leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data.
Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.
arXiv Detail & Related papers (2024-12-11T08:03:35Z) - Learning to Plan for Language Modeling from Unlabeled Data [23.042650737356496]
We train a module for planning the future writing process via a self-supervised learning objective.
Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space.
arXiv Detail & Related papers (2024-03-31T09:04:01Z) - PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset [0.0]
We present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow.
It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal.
Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios.
arXiv Detail & Related papers (2024-03-05T18:01:59Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - Pretrained Language Models as Visual Planners for Human Assistance [12.8775186900555]
Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
arXiv Detail & Related papers (2023-04-17T18:07:36Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.