Event-Guided Procedure Planning from Instructional Videos with Text
Supervision
- URL: http://arxiv.org/abs/2308.08885v1
- Date: Thu, 17 Aug 2023 09:43:28 GMT
- Title: Event-Guided Procedure Planning from Instructional Videos with Text
Supervision
- Authors: An-Lan Wang, Kun-Yu Lin, Jia-Run Du, Jingke Meng, Wei-Shi Zheng
- Abstract summary: We focus on the task of procedure planning from instructional videos with text supervision.
A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions.
We propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events.
- Score: 31.82121743586165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we focus on the task of procedure planning from instructional
videos with text supervision, where a model aims to predict an action sequence
to transform the initial visual state into the goal visual state. A critical
challenge of this task is the large semantic gap between observed visual states
and unobserved intermediate actions, which is ignored by previous works.
Specifically, this semantic gap refers to that the contents in the observed
visual states are semantically different from the elements of some action text
labels in a procedure. To bridge this semantic gap, we propose a novel
event-guided paradigm, which first infers events from the observed states and
then plans out actions based on both the states and predicted events. Our
inspiration comes from that planning a procedure from an instructional video is
to complete a specific event and a specific event usually involves specific
actions. Based on the proposed paradigm, we contribute an Event-guided
Prompting-based Procedure Planning (E3P) model, which encodes event information
into the sequential modeling process to support procedure planning. To further
consider the strong action associations within each event, our E3P adopts a
mask-and-predict approach for relation mining, incorporating a probabilistic
masking scheme for regularization. Extensive experiments on three datasets
demonstrate the effectiveness of our proposed model.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Open-Event Procedure Planning in Instructional Videos [18.67781706733587]
We introduce a new task named Open-event Procedure Planning (OEPP), which extends the traditional procedure planning to the open-event setting.
OEPP aims to verify whether a planner can transfer the learned knowledge to similar events that have not been seen during training.
arXiv Detail & Related papers (2024-07-06T16:11:46Z) - ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos [10.180115984765582]
ActionDiffusion is a novel diffusion model for procedure planning in instructional videos.
Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process.
arXiv Detail & Related papers (2024-03-13T14:54:04Z) - SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional
Videos [54.01116513202433]
We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.
Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures.
We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures.
arXiv Detail & Related papers (2024-03-03T19:53:06Z) - Pretext Training Algorithms for Event Sequence Data [29.70078362944441]
This paper proposes a self-supervised pretext training framework tailored to event sequence data.
Our pretext tasks unlock foundational representations that are generalizable across different down-stream tasks.
arXiv Detail & Related papers (2024-02-16T01:25:21Z) - Tapestry of Time and Actions: Modeling Human Activity Sequences using
Temporal Point Process Flows [9.571588145356277]
We present ProActive, a framework for modeling the continuous-time distribution of actions in an activity sequence.
ProActive addresses three high-impact problems -- next action prediction, sequence-goal prediction, and end-to-end sequence generation.
arXiv Detail & Related papers (2023-07-13T19:17:54Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - Detecting Ongoing Events Using Contextual Word and Sentence Embeddings [110.83289076967895]
This paper introduces the Ongoing Event Detection (OED) task.
The goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current.
Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an OED system.
arXiv Detail & Related papers (2020-07-02T20:44:05Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.