From Videos to Conversations: Egocentric Instructions for Task Assistance
- URL: http://arxiv.org/abs/2602.01038v1
- Date: Sun, 01 Feb 2026 05:53:41 GMT
- Title: From Videos to Conversations: Egocentric Instructions for Task Assistance
- Authors: Lavisha Aggarwal, Vikas Bahirwani, Andrea Colaco,
- Abstract summary: We present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations.<n>Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches.
- Score: 2.848400947017194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.
Related papers
- IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval [36.33423199468626]
Interactive Video Corpus Retrieval (IVCR) task enables multi-turn, conversational, and realistic interactions between the user and the retrieval system.<n> IVCR-200K is a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval.<n>We propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions.
arXiv Detail & Related papers (2025-12-01T06:12:59Z) - A Multimodal Conversational Agent for Tabular Data Analysis [0.2211620227346065]
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance.<n>We present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration.<n>The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations.
arXiv Detail & Related papers (2025-11-23T11:21:04Z) - Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark [4.583536383592244]
We propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues.<n>Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection.<n>We build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting.
arXiv Detail & Related papers (2025-08-15T03:57:20Z) - Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [48.30863954384779]
This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.<n>First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos.<n>Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies.<n>Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses.
arXiv Detail & Related papers (2025-06-06T09:23:29Z) - InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models [11.913271486031201]
We develop a Context-aware instructional task assistant with multi-modal large language models (InsTALL)<n>InsTALL responds in real-time to user queries related to the task at hand.<n>We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding.
arXiv Detail & Related papers (2025-01-21T15:55:06Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment.
We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.