ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
- URL: http://arxiv.org/abs/2403.08591v2
- Date: Sat, 20 Jul 2024 17:55:05 GMT
- Title: ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
- Authors: Lei Shi, Paul Bürkner, Andreas Bulling,
- Abstract summary: ActionDiffusion is a novel diffusion model for procedure planning in instructional videos.
Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process.
- Score: 10.180115984765582
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.
Related papers
- ADiff4TPP: Asynchronous Diffusion Models for Temporal Point Processes [30.928368603673285]
This work introduces a novel approach to modeling temporal point processes using diffusion models with an asynchronous noise schedule.
We derive an objective to effectively train these models for a general family of noise schedules based on conditional flow matching.
Our method achieves the joint distribution of the latent representations of events in a sequence and state-of-the-art results in predicting both the next inter-event time and event type on benchmark datasets.
arXiv Detail & Related papers (2025-04-29T04:17:39Z) - Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.
By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.
On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z) - CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning [11.4414301678724]
We propose a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos.
Our method uses a Variational Autoencoder to learn the latent representation of actions and observations as constraints.
We show that our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-09T14:31:46Z) - Dual Conditional Diffusion Models for Sequential Recommendation [63.82152785755723]
We propose Dual Conditional Diffusion Models for Sequential Recommendation (DCRec)
DCRec integrates implicit and explicit information by embedding dual conditions into both the forward and reverse diffusion processes.
This allows the model to retain valuable sequential and contextual information while leveraging explicit user-item interactions to guide the recommendation process.
arXiv Detail & Related papers (2024-10-29T11:51:06Z) - Intention-aware Denoising Diffusion Model for Trajectory Prediction [14.524496560759555]
Trajectory prediction is an essential component in autonomous driving, particularly for collision avoidance systems.
We propose utilizing the diffusion model to generate the distribution of future trajectories.
We propose an Intention-aware denoising Diffusion Model (IDM)
Our methods achieve state-of-the-art results, with an FDE of 13.83 pixels on the SDD dataset and 0.36 meters on the ETH/UCY dataset.
arXiv Detail & Related papers (2024-03-14T09:05:25Z) - Planning as In-Painting: A Diffusion-Based Embodied Task Planning
Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems.
We propose a task-agnostic method named 'planning as in-painting'
The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z) - Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z) - Event-Guided Procedure Planning from Instructional Videos with Text
Supervision [31.82121743586165]
We focus on the task of procedure planning from instructional videos with text supervision.
A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions.
We propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events.
arXiv Detail & Related papers (2023-08-17T09:43:28Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - PDPP:Projected Diffusion for Procedure Planning in Instructional Videos [30.637651835289635]
We study the problem of procedure planning in instructional videos.
This problem aims to make goal-directed plans given the current visual observations in unstructured real-life videos.
arXiv Detail & Related papers (2023-03-26T10:50:16Z) - FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality
Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable.
We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.