Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
- URL: http://arxiv.org/abs/2507.15130v1
- Date: Sun, 20 Jul 2025 21:39:05 GMT
- Title: Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
- Authors: Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur,
- Abstract summary: Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress.<n>Recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding.<n>We identify two challenges in training large MLLMs for video-based planning tasks.
- Score: 41.63965006043724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model's ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model's planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. Code will be made available.
Related papers
- Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos.
It employs a breadth-first search algorithm for optimal plan generation.
Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z) - Show and Guide: Instructional-Plan Grounded Vision and Language Model [9.84151565227816]
We present MM-PlanLLM, the first multimodal plan-following language model.
We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation.
MM-PlanLLM is trained using a novel multitask-multistage approach.
arXiv Detail & Related papers (2024-09-27T18:20:24Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Temporal DINO: A Self-supervised Video Strategy to Enhance Action
Prediction [15.696593695918844]
This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels)
The experimental results showcase significant improvements in prediction performance across 3D-ResNet, Transformer, and LSTM architectures.
These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.
arXiv Detail & Related papers (2023-08-08T21:18:23Z) - EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z) - Pretrained Language Models as Visual Planners for Human Assistance [12.8775186900555]
Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
arXiv Detail & Related papers (2023-04-17T18:07:36Z) - Long-Horizon Visual Planning with Goal-Conditioned Hierarchical
Predictors [124.30562402952319]
The ability to predict and plan into the future is fundamental for agents acting in the world.
Current learning approaches for visual prediction and planning fail on long-horizon tasks.
We propose a framework for visual prediction and planning that is able to overcome both of these limitations.
arXiv Detail & Related papers (2020-06-23T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.