Related papers: Open-Event Procedure Planning in Instructional Videos

Open-Event Procedure Planning in Instructional Videos

URL: http://arxiv.org/abs/2407.05119v1
Date: Sat, 6 Jul 2024 16:11:46 GMT
Title: Open-Event Procedure Planning in Instructional Videos
Authors: Yilu Wu, Hanlin Wang, Jing Wang, Limin Wang,
Abstract summary: We introduce a new task named Open-event Procedure Planning (OEPP), which extends the traditional procedure planning to the open-event setting. OEPP aims to verify whether a planner can transfer the learned knowledge to similar events that have not been seen during training.
Score: 18.67781706733587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the current visual observations, the traditional procedure planning task in instructional videos requires a model to generate goal-directed plans within a given action space. All previous methods for this task conduct training and inference under the same action space, and they can only plan for pre-defined events in the training set. We argue this setting is not applicable for human assistance in real lives and aim to propose a more general and practical planning paradigm. Specifically, in this paper, we introduce a new task named Open-event Procedure Planning (OEPP), which extends the traditional procedure planning to the open-event setting. OEPP aims to verify whether a planner can transfer the learned knowledge to similar events that have not been seen during training. We rebuild a new benchmark of OpenEvent for this task based on existing datasets and divide the events involved into base and novel parts. During the data collection process, we carefully ensure the transfer ability of procedural knowledge for base and novel events by evaluating the similarity between the descriptions of different event steps with multiple stages. Based on the collected data, we further propose a simple and general framework specifically designed for OEPP, and conduct extensive study with various baseline methods, providing a detailed and insightful analysis on the results for this task.

Related papers

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following [62.10809033451526]
This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs) We frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption.
arXiv Detail & Related papers (2024-12-27T10:05:45Z)
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. It employs a breadth-first search algorithm for optimal plan generation. Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z)
OpenEP: Open-Ended Future Event Prediction [57.63525290892786]
We introduce OpenEP (an Open-Ended Future Event Prediction task), which generates flexible and diverse predictions aligned with real-world scenarios. For question construction, we pose questions from seven perspectives, including location, time, event development, event outcome, event impact, event response, and other. For outcome construction, we collect free-form text containing the outcomes as ground truth to provide semantically complete and detail-enriched outcomes.
arXiv Detail & Related papers (2024-08-13T02:35:54Z)
Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following [17.608330952846075]
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. We introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data.
arXiv Detail & Related papers (2024-04-21T08:10:20Z)
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos [46.26690150997731]
We propose a new and practical setting, called adaptive procedure planning in instructional videos. RAP adaptively determines the conclusion of actions using an auto-regressive model architecture.
arXiv Detail & Related papers (2024-03-27T14:22:40Z)
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos [16.333295670635557]
We explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data.
arXiv Detail & Related papers (2024-03-05T08:55:51Z)
Pretext Training Algorithms for Event Sequence Data [29.70078362944441]
This paper proposes a self-supervised pretext training framework tailored to event sequence data. Our pretext tasks unlock foundational representations that are generalizable across different down-stream tasks.
arXiv Detail & Related papers (2024-02-16T01:25:21Z)
Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems. We propose a task-agnostic method named 'planning as in-painting' The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z)
Event-Guided Procedure Planning from Instructional Videos with Text Supervision [31.82121743586165]
We focus on the task of procedure planning from instructional videos with text supervision. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions. We propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events.
arXiv Detail & Related papers (2023-08-17T09:43:28Z)
Zero-Shot On-the-Fly Event Schema Induction [61.91468909200566]
We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner.
arXiv Detail & Related papers (2022-10-12T14:37:00Z)
Procedure Planning in Instructional Videosvia Contextual Modeling and Model-based Policy Learning [114.1830997893756]
This work focuses on learning a model to plan goal-directed actions in real-life videos. We propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning.
arXiv Detail & Related papers (2021-10-05T01:06:53Z)
PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning [28.9887381071402]
We propose a fine-grained action definition for Scene Rearrangement Planning (SRP) and introduce a large-scale scene rearrangement dataset. We also propose a novel learning paradigm to efficiently train an agent through self-playing, without any prior knowledge.
arXiv Detail & Related papers (2021-05-10T03:27:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.