Related papers: PDPP: Projected Diffusion for Procedure Planning in Instructional Videos

PDPP: Projected Diffusion for Procedure Planning in Instructional Videos

URL: http://arxiv.org/abs/2303.14676v3
Date: Wed, 22 Jan 2025 09:50:01 GMT
Title: PDPP: Projected Diffusion for Procedure Planning in Instructional Videos
Authors: Hanlin Wang, Yilu Wu, Sheng Guo, Limin Wang,
Abstract summary: We study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal.<n>Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision.<n>To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework.
Score: 18.984980596601513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP, to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a series of condition-introducing methods in our framework, including condition embeddings, MoEs, two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstrates the effectiveness and generalization ability of our model.

Related papers

Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z)
Interpreting Emergent Planning in Model-Free Reinforcement Learning [13.820891288919002]
We present the first evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban.
arXiv Detail & Related papers (2025-04-02T16:24:23Z)
CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning [11.4414301678724]
We propose a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Our method uses a Variational Autoencoder to learn the latent representation of actions and observations as constraints. We show that our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-09T14:31:46Z)
Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following [62.10809033451526]
This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs) We frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption.
arXiv Detail & Related papers (2024-12-27T10:05:45Z)
Pattern-Aware Chain-of-Thought Prompting in Large Language Models [26.641713417293538]
Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning. We show that the underlying reasoning patterns play a more crucial role in such tasks. We propose Pattern-Aware CoT, a prompting method that considers the diversity of demonstration patterns.
arXiv Detail & Related papers (2024-04-23T07:50:00Z)
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos [10.180115984765582]
ActionDiffusion is a novel diffusion model for procedure planning in instructional videos. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process.
arXiv Detail & Related papers (2024-03-13T14:54:04Z)
CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z)
Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems. We propose a task-agnostic method named 'planning as in-painting' The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z)
Refining Diffusion Planner for Reliable Behavior Synthesis by Automatic Detection of Infeasible Plans [25.326624139426514]
Diffusion-based planning has shown promising results in long-horizon, sparse-reward tasks. However, due to their nature as generative models, diffusion models are not guaranteed to generate feasible plans. We propose a novel approach to refine unreliable plans generated by diffusion models by providing refining guidance to error-prone plans.
arXiv Detail & Related papers (2023-10-30T10:35:42Z)
Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization [87.21285093582446]
Diffusion Generative Flow Samplers (DGFS) is a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments. Our method takes inspiration from the theory developed for generative flow networks (GFlowNets)
arXiv Detail & Related papers (2023-10-04T09:39:05Z)
Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z)
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision [31.73732506824829]
We study the problem of procedure planning in instructional videos. Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state. We propose a weakly supervised approach by learning from natural language instructions.
arXiv Detail & Related papers (2022-05-04T19:37:32Z)
Visual Learning-based Planning for Continuous High-Dimensional POMDPs [81.16442127503517]
Visual Tree Search (VTS) is a learning and planning procedure that combines generative models learned offline with online model-based POMDP planning. VTS bridges offline model training and online planning by utilizing a set of deep generative observation models to predict and evaluate the likelihood of image observations in a Monte Carlo tree search planner. We show that VTS is robust to different observation noises and, since it utilizes online, model-based planning, can adapt to different reward structures without the need to re-train.
arXiv Detail & Related papers (2021-12-17T11:53:31Z)
Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS) Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z)
Evaluating model-based planning and planner amortization for continuous control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning. We find that well-tuned model-free agents are strong baselines even for high DoF control problems. We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z)
Paired Examples as Indirect Supervision in Latent Decision Models [109.76417071249945]
We introduce a way to leverage paired examples that provide stronger cues for learning latent decisions. We apply our method to improve compositional question answering using neural module networks on the DROP dataset.
arXiv Detail & Related papers (2021-04-05T03:58:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.