Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
- URL: http://arxiv.org/abs/2508.11192v1
- Date: Fri, 15 Aug 2025 03:57:20 GMT
- Title: Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
- Authors: Lavisha Aggarwal, Vikas Bahirwani, Lin Li, Andrea Colaco,
- Abstract summary: We propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues.<n>Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection.<n>We build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting.
- Score: 4.583536383592244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user's surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.
Related papers
- From Videos to Conversations: Egocentric Instructions for Task Assistance [2.848400947017194]
We present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations.<n>Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches.
arXiv Detail & Related papers (2026-02-01T05:53:41Z) - IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval [36.33423199468626]
Interactive Video Corpus Retrieval (IVCR) task enables multi-turn, conversational, and realistic interactions between the user and the retrieval system.<n> IVCR-200K is a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval.<n>We propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions.
arXiv Detail & Related papers (2025-12-01T06:12:59Z) - Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z) - Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [48.30863954384779]
This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.<n>First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos.<n>Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies.<n>Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses.
arXiv Detail & Related papers (2025-06-06T09:23:29Z) - Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z) - InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models [11.913271486031201]
We develop a Context-aware instructional task assistant with multi-modal large language models (InsTALL)<n>InsTALL responds in real-time to user queries related to the task at hand.<n>We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding.
arXiv Detail & Related papers (2025-01-21T15:55:06Z) - HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI
Assistants in the Real World [48.90399899928823]
This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world.
We introduce HoloAssist, a large-scale egocentric human interaction dataset.
We present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.
arXiv Detail & Related papers (2023-09-29T07:17:43Z) - HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly
Knowledge Understanding [5.233797258148846]
HA-ViD is the first human assembly video dataset that features representative industrial assembly scenarios.
We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels.
We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
arXiv Detail & Related papers (2023-07-09T08:44:46Z) - KETOD: Knowledge-Enriched Task-Oriented Dialogue [77.59814785157877]
Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains.
We investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model.
arXiv Detail & Related papers (2022-05-11T16:01:03Z) - Few-Shot Bot: Prompt-Based Learning for Dialogue Systems [58.27337673451943]
Learning to converse using only a few examples is a great challenge in conversational AI.
The current best conversational models are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL)
We propose prompt-based few-shot learning which does not require gradient-based fine-tuning but instead uses a few examples as the only source of learning.
arXiv Detail & Related papers (2021-10-15T14:36:45Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.