Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos
- URL: http://arxiv.org/abs/2309.07409v1
- Date: Thu, 14 Sep 2023 03:25:37 GMT
- Title: Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos
- Authors: Fen Fang, Yun Liu, Ali Koksal, Qianli Xu, Joo-Hwee Lim
- Abstract summary: A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
- Score: 16.93979476655776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge with procedure planning in instructional videos lies in how
to handle a large decision space consisting of a multitude of action types that
belong to various tasks. To understand real-world video content, an AI agent
must proficiently discern these action types (e.g., pour milk, pour water, open
lid, close lid, etc.) based on brief visual observation. Moreover, it must
adeptly capture the intricate semantic relation of the action types and task
goals, along with the variable action sequences. Recently, notable progress has
been made via the integration of diffusion models and visual representation
learning to address the challenge. However, existing models employ rudimentary
mechanisms to utilize task information to manage the decision space. To
overcome this limitation, we introduce a simple yet effective enhancement - a
masked diffusion model. The introduced mask acts akin to a task-oriented
attention filter, enabling the diffusion/denoising process to concentrate on a
subset of action types. Furthermore, to bolster the accuracy of task
classification, we harness more potent visual representation learning
techniques. In particular, we learn a joint visual-text embedding, where a text
embedding is generated by prompting a pre-trained vision-language model to
focus on human actions. We evaluate the method on three public datasets and
achieve state-of-the-art performance on multiple metrics. Code is available at
https://github.com/ffzzy840304/Masked-PDPP.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions.
InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks.
It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z) - Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition.
Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images.
We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z) - SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models [22.472167814814448]
We propose a new model-based imitation learning algorithm named Separated Model-based Adversarial Imitation Learning (SeMAIL)
Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.
arXiv Detail & Related papers (2023-06-19T04:33:44Z) - ActAR: Actor-Driven Pose Embeddings for Video Action Recognition [12.043574473965318]
Human action recognition (HAR) in videos is one of the core tasks of video understanding.
We propose a new method that simultaneously learns to recognize efficiently human actions in the infrared spectrum.
arXiv Detail & Related papers (2022-04-19T05:12:24Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.