Related papers: VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

URL: http://arxiv.org/abs/2602.19146v1
Date: Sun, 22 Feb 2026 12:20:28 GMT
Title: VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Authors: Diogo Glória-Silva, David Semedo, João Maglhães,
Abstract summary: We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans.<n>Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.
Score: 2.836258000910872
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Related papers

Visual Planning: Let's Think Only with Images [46.501897093584965]
We argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information.<n>Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks.<n>In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.
arXiv Detail & Related papers (2025-05-16T16:17:22Z)
Show and Guide: Instructional-Plan Grounded Vision and Language Model [9.84151565227816]
We present MM-PlanLLM, the first multimodal plan-following language model. We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation. MM-PlanLLM is trained using a novel multitask-multistage approach.
arXiv Detail & Related papers (2024-09-27T18:20:24Z)
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture. Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z)
Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task. It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z)
Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue. We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task. Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.