Related papers: ExpertAF: Expert Actionable Feedback from Video

ExpertAF: Expert Actionable Feedback from Video

URL: http://arxiv.org/abs/2408.00672v1
Date: Thu, 1 Aug 2024 16:13:07 GMT
Title: ExpertAF: Expert Actionable Feedback from Video
Authors: Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman,
Abstract summary: We introduce a novel method to generate actionable feedback from video of a person doing a physical activity. Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching.
Score: 81.46431188306397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching -- expert commentary, expert video retrieval, and the first-of-its-kind expert pose generation -- outperforming strong vision-language models on both established metrics and human preference studies.

Related papers

Learning Skill-Attributes for Transferable Assessment in Video [56.813876909367856]
Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better.<n>Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning.<n>By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques.
arXiv Detail & Related papers (2025-11-17T23:53:06Z)
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation [3.115853870709636]
We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning.<n>It jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos.
arXiv Detail & Related papers (2025-09-30T14:00:41Z)
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding [23.96372422130216]
Video-based Large Language Models (VideoVid-LLMs) have witnessed substantial advancements in recent years. They struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks greatly improve their fine-grained video understanding abilities.
arXiv Detail & Related papers (2025-04-10T13:40:34Z)
CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training [35.43906754134253]
We propose CustomTTT, where we can joint custom the appearance and motion of the given video easily. Since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-12-20T08:05:13Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z)
Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z)
CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos. Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z)
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time. We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z)
Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training. To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
Learning Object Manipulation Skills via Approximate State Estimation from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos. On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain. In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.