SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional
Videos
- URL: http://arxiv.org/abs/2403.01599v1
- Date: Sun, 3 Mar 2024 19:53:06 GMT
- Title: SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional
Videos
- Authors: Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang
- Abstract summary: We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.
Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures.
We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures.
- Score: 54.01116513202433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of procedure planning in instructional videos, which
aims to make a goal-oriented sequence of action steps given partial visual
state observations. The motivation of this problem is to learn a structured and
plannable state and action space. Recent works succeeded in sequence modeling
of steps with only sequence-level annotations accessible during training, which
overlooked the roles of states in the procedures. In this work, we point out
that State CHangEs MAtter (SCHEMA) for procedure planning in instructional
videos. We aim to establish a more structured state space by investigating the
causal relations between steps and states in procedures. Specifically, we
explicitly represent each step as state changes and track the state changes in
procedures. For step representation, we leveraged the commonsense knowledge in
large language models (LLMs) to describe the state changes of steps via our
designed chain-of-thought prompting. For state change tracking, we align visual
state observations with language state descriptions via cross-modal contrastive
learning, and explicitly model the intermediate states of the procedure using
LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV
benchmark datasets demonstrate that our proposed SCHEMA model achieves
state-of-the-art performance and obtains explainable visualizations.
Related papers
- STAT: Towards Generalizable Temporal Action Localization [56.634561073746056]
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels.
Existing methods suffer from severe performance degradation when transferring to different distributions.
We propose GTAL, which focuses on improving the generalizability of action localization methods.
arXiv Detail & Related papers (2024-04-20T07:56:21Z) - RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos [46.26690150997731]
We propose a new and practical setting, called adaptive procedure planning in instructional videos.
RAP adaptively determines the conclusion of actions using an auto-regressive model architecture.
arXiv Detail & Related papers (2024-03-27T14:22:40Z) - ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos [10.180115984765582]
ActionDiffusion is a novel diffusion model for procedure planning in instructional videos.
Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process.
arXiv Detail & Related papers (2024-03-13T14:54:04Z) - Skip-Plan: Procedure Planning in Instructional Videos via Condensed
Action Space Learning [85.84504287685884]
Skip-Plan is a condensed action space learning method for procedure planning in instructional videos.
By skipping uncertain nodes and edges in action chains, we transfer long and complex sequence functions into short but reliable ones.
Our model explores all sorts of reliable sub-relations within an action sequence in the condensed action space.
arXiv Detail & Related papers (2023-10-01T08:02:33Z) - Event-Guided Procedure Planning from Instructional Videos with Text
Supervision [31.82121743586165]
We focus on the task of procedure planning from instructional videos with text supervision.
A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions.
We propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events.
arXiv Detail & Related papers (2023-08-17T09:43:28Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - Language Modeling with Latent Situations [46.38670628102201]
SituationSupervision is a family of approaches for improving coherence in language models.
It trains models to construct and condition on explicit representations of entities and their states.
It produces major coherence improvements between 4-11%.
arXiv Detail & Related papers (2022-12-20T05:59:42Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.