Learning Skill-Attributes for Transferable Assessment in Video
- URL: http://arxiv.org/abs/2511.13993v1
- Date: Mon, 17 Nov 2025 23:53:06 GMT
- Title: Learning Skill-Attributes for Transferable Assessment in Video
- Authors: Kumar Ashutosh, Kristen Grauman,
- Abstract summary: Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better.<n>Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning.<n>By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques.
- Score: 56.813876909367856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better. Today's models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning -- whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., "lift hands more to generate more power" as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today's multimodal large language models.
Related papers
- SkillSight: Efficient First-Person Skill Assessment with Gaze [51.16409727318035]
We introduce SkillSight for power-efficient skill assessment from first-person data.<n>Our two-stage framework learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model.<n>Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding.
arXiv Detail & Related papers (2025-11-24T19:05:28Z) - DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning [25.001089287899998]
DeepSport is the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding.<n>Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
arXiv Detail & Related papers (2025-11-17T02:57:15Z) - SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports [21.410115837645318]
SportR is the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.<n>Our benchmark provides a dataset of 5,017 images and 2,101 videos.<n>For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought annotations.
arXiv Detail & Related papers (2025-11-09T18:55:20Z) - ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation [3.115853870709636]
We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning.<n>It jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos.
arXiv Detail & Related papers (2025-09-30T14:00:41Z) - ExpertAF: Expert Actionable Feedback from Video [81.46431188306397]
We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer.<n>Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary describing what the person is doing well and what they could improve.<n>We show how to leverage Ego-Exo4D's [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task.
arXiv Detail & Related papers (2024-08-01T16:13:07Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Sports Video Analysis on Large-Scale Data [10.24207108909385]
This paper investigates the modeling of automated machine description on sports video.
We propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning.
arXiv Detail & Related papers (2022-08-09T16:59:24Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.