Related papers: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

URL: http://arxiv.org/abs/2507.03393v1
Date: Fri, 04 Jul 2025 08:54:59 GMT
Title: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang,
Abstract summary: We address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations.<n>Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions.<n>We propose the Masked Temporal Interpolation Diffusion Diffusion model that introduces a latent space temporal temporal module within the diffusion model.
Score: 32.71627274876863
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.

Related papers

FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection [4.015022008487465]
Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and boundaries.<n>We propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models.<n>Our method achieves state-of-the-art performance on temporal action detection benchmarks.
arXiv Detail & Related papers (2025-04-01T10:57:37Z)
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation [66.8640112000444]
Temporal action segmentation and long-term action anticipation are popular vision tasks for the temporal analysis of actions in videos.<n>We tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion.<n>We introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future.
arXiv Detail & Related papers (2024-12-05T17:12:35Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
Tapestry of Time and Actions: Modeling Human Activity Sequences using Temporal Point Process Flows [9.571588145356277]
We present ProActive, a framework for modeling the continuous-time distribution of actions in an activity sequence. ProActive addresses three high-impact problems -- next action prediction, sequence-goal prediction, and end-to-end sequence generation.
arXiv Detail & Related papers (2023-07-13T19:17:54Z)
Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z)
Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning. Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z)
ProActive: Self-Attentive Temporal Point Process Flows for Activity Sequences [9.571588145356277]
ProActive is a framework for modeling the continuous-time distribution of actions in an activity sequence. It addresses next action prediction, sequence-goal prediction, and end-to-end sequence generation.
arXiv Detail & Related papers (2022-06-10T16:30:55Z)
AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem. We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z)
ASFormer: Transformer for Action Segmentation [9.509416095106493]
We present an efficient Transformer-based model for action segmentation task, named ASFormer. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
arXiv Detail & Related papers (2021-10-16T13:07:20Z)
MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task. The first stage generates an initial prediction that is refined by the next ones. Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences [25.299599341774204]
This paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space. We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames.
arXiv Detail & Related papers (2020-01-29T22:51:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.