PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
- URL: http://arxiv.org/abs/2303.14676v2
- Date: Sun, 23 Jul 2023 09:41:51 GMT
- Title: PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
- Authors: Hanlin Wang, Yilu Wu, Sheng Guo, Limin Wang
- Abstract summary: We study the problem of procedure planning in instructional videos.
This problem aims to make goal-directed plans given the current visual observations in unstructured real-life videos.
- Score: 30.637651835289635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of procedure planning in instructional
videos, which aims to make goal-directed plans given the current visual
observations in unstructured real-life videos. Previous works cast this problem
as a sequence planning problem and leverage either heavy intermediate visual
observations or natural language instructions as supervision, resulting in
complex learning schemes and expensive annotation costs. In contrast, we treat
this problem as a distribution fitting problem. In this sense, we model the
whole intermediate action sequence distribution with a diffusion model (PDPP),
and thus transform the planning problem to a sampling process from this
distribution. In addition, we remove the expensive intermediate supervision,
and simply use task labels from instructional videos as supervision instead.
Our model is a U-Net based diffusion model, which directly samples action
sequences from the learned distribution with the given start and end
observations. Furthermore, we apply an efficient projection method to provide
accurate conditional guides for our model during the learning and sampling
process. Experiments on three datasets with different scales show that our PDPP
model can achieve the state-of-the-art performance on multiple metrics, even
without the task supervision. Code and trained models are available at
https://github.com/MCG-NJU/PDPP.
Related papers
- Pattern-Aware Chain-of-Thought Prompting in Large Language Models [26.641713417293538]
Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning.
We show that the underlying reasoning patterns play a more crucial role in such tasks.
We propose Pattern-Aware CoT, a prompting method that considers the diversity of demonstration patterns.
arXiv Detail & Related papers (2024-04-23T07:50:00Z) - ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos [10.180115984765582]
ActionDiffusion is a novel diffusion model for procedure planning in instructional videos.
Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process.
arXiv Detail & Related papers (2024-03-13T14:54:04Z) - CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z) - Diffusion Generative Flow Samplers: Improving learning signals through
partial trajectory optimization [87.21285093582446]
Diffusion Generative Flow Samplers (DGFS) is a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments.
Our method takes inspiration from the theory developed for generative flow networks (GFlowNets)
arXiv Detail & Related papers (2023-10-04T09:39:05Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - P3IV: Probabilistic Procedure Planning from Instructional Videos with
Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos.
Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state.
We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z) - Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS)
Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime.
Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z) - Evaluating model-based planning and planner amortization for continuous
control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning.
We find that well-tuned model-free agents are strong baselines even for high DoF control problems.
We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z) - Paired Examples as Indirect Supervision in Latent Decision Models [109.76417071249945]
We introduce a way to leverage paired examples that provide stronger cues for learning latent decisions.
We apply our method to improve compositional question answering using neural module networks on the DROP dataset.
arXiv Detail & Related papers (2021-04-05T03:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.