What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
- URL: http://arxiv.org/abs/2503.21055v1
- Date: Thu, 27 Mar 2025 00:03:55 GMT
- Title: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
- Authors: Chi-Hsi Kung, Frangil Ramirez, Juhyung Ha, Yi-Ting Chen, David Crandall, Yi-Hsuan Tsai,
- Abstract summary: We study procedure-aware video representation learning by incorporating state-change descriptions.<n>We generate state-change counterfactuals that simulate hypothesized failure outcomes.<n>Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals.
- Score: 22.00652926645987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation and error detection. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals and achieve significant improvements on multiple tasks. We will make our source code and data publicly available soon.
Related papers
- Learning Actionable World Models for Industrial Process Control [5.870452455598225]
An effective AI system must learn about the behavior of the complex system from very limited training data.<n>We propose a novel methodology that disentangles process parameters in the learned latent representation, allowing for fine-grained control.
arXiv Detail & Related papers (2025-03-03T11:05:44Z) - SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional
Videos [54.01116513202433]
We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.
Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures.
We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures.
arXiv Detail & Related papers (2024-03-03T19:53:06Z) - Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - STEPs: Self-Supervised Key Step Extraction and Localization from
Unlabeled Procedural Videos [40.82053186029603]
We decompose the problem into two steps: representation learning and key steps extraction.
We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels.
We show significant improvements over prior works for the task of key step localization and phase classification.
arXiv Detail & Related papers (2023-01-02T18:32:45Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Chain of Thought Imitation with Procedure Cloning [129.62135987416164]
We propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations.
We show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations.
arXiv Detail & Related papers (2022-05-22T13:14:09Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - Procedure Planning in Instructional Videosvia Contextual Modeling and
Model-based Policy Learning [114.1830997893756]
This work focuses on learning a model to plan goal-directed actions in real-life videos.
We propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning.
arXiv Detail & Related papers (2021-10-05T01:06:53Z) - Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning.
Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents.
We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.