Vid2Coach: Transforming How-To Videos into Task Assistants
- URL: http://arxiv.org/abs/2506.00717v2
- Date: Fri, 25 Jul 2025 17:41:26 GMT
- Title: Vid2Coach: Transforming How-To Videos into Task Assistants
- Authors: Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, Amy Pavel,
- Abstract summary: We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants.<n> Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step.<n>It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources.
- Score: 51.729869497134885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback. We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5\% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise.
Related papers
- SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding [23.96372422130216]
Video-based Large Language Models (VideoVid-LLMs) have witnessed substantial advancements in recent years.<n>They struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries.<n>To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks greatly improve their fine-grained video understanding abilities.
arXiv Detail & Related papers (2025-04-10T13:40:34Z) - PVChat: Personalized Video Chat with One-Shot Learning [15.328085576102106]
PVChat is a one-shot learning framework that enables subject-aware question answering from a single video for each subject.<n>Our approach optimize a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset.<n>We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage.
arXiv Detail & Related papers (2025-03-21T11:50:06Z) - When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning [80.09819072780193]
We propose a self-supervised framework that leverages Temporal Correspondence for video representation learning (T-CoRe)<n>Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning.
arXiv Detail & Related papers (2025-03-19T10:50:03Z) - Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.<n>A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.<n>Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z) - ExpertAF: Expert Actionable Feedback from Video [81.46431188306397]
We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer.<n>Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary describing what the person is doing well and what they could improve.<n>We show how to leverage Ego-Exo4D's [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task.
arXiv Detail & Related papers (2024-08-01T16:13:07Z) - Valley: Video Assistant with Large Language model Enhanced abilitY [46.90402681897982]
We introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities.<n>Our experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios.
arXiv Detail & Related papers (2023-06-12T16:11:10Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - NarrationBot and InfoBot: A Hybrid System for Automated Video
Description [9.59921187620835]
We develop a hybrid system of two tools to automatically generate descriptions for videos.
We show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem.
Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos.
arXiv Detail & Related papers (2021-11-07T04:13:30Z) - Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z) - Translating Video Recordings of Mobile App Usages into Replayable
Scenarios [24.992877070869177]
V2S is a lightweight, automated approach for translating video recordings of Android app usages into replayable scenarios.
We performed an extensive evaluation of V2S involving 175 videos depicting 3,534 GUI-based actions collected from users exercising features and reproducing bugs from over 80 popular Android apps.
arXiv Detail & Related papers (2020-05-18T20:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.