Video Pose Distillation for Few-Shot, Fine-Grained Sports Action
Recognition
- URL: http://arxiv.org/abs/2109.01305v1
- Date: Fri, 3 Sep 2021 04:36:12 GMT
- Title: Video Pose Distillation for Few-Shot, Fine-Grained Sports Action
Recognition
- Authors: James Hong, Matthew Fisher, Micha\"el Gharbi, Kayvon Fatahalian
- Abstract summary: Video Pose Distillation (VPD) is a weakly-supervised technique to learn features for new video domains.
VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in four real-world sports video datasets.
- Score: 17.84533144792773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human pose is a useful feature for fine-grained sports action understanding.
However, pose estimators are often unreliable when run on sports video due to
domain shift and factors such as motion blur and occlusions. This leads to poor
accuracy when downstream tasks, such as action recognition, depend on pose.
End-to-end learning circumvents pose, but requires more labels to generalize.
We introduce Video Pose Distillation (VPD), a weakly-supervised technique to
learn features for new video domains, such as individual sports that challenge
pose estimation. Under VPD, a student network learns to extract robust pose
features from RGB frames in the sports video, such that, whenever pose is
considered reliable, the features match the output of a pretrained teacher pose
detector. Our strategy retains the best of both pose and end-to-end worlds,
exploiting the rich visual patterns in raw video frames, while learning
features that agree with the athletes' pose and motion in the target video
domain to avoid over-fitting to patterns unrelated to athletes' motion.
VPD features improve performance on few-shot, fine-grained action
recognition, retrieval, and detection tasks in four real-world sports video
datasets, without requiring additional ground-truth pose annotations.
Related papers
- Seeing the Pose in the Pixels: Learning Pose-Aware Representations in
Vision Transformers [1.8047694351309207]
We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT)
The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos.
The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
arXiv Detail & Related papers (2023-06-15T17:58:39Z) - Towards Active Learning for Action Spotting in Association Football
Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns.
Current algorithms face significant challenges when learning from limited annotated data.
We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z) - A Survey on Video Action Recognition in Sports: Datasets, Methods and
Applications [60.3327085463545]
We present a survey on video action recognition for sports analytics.
We introduce more than ten types of sports, including team sports, such as football, basketball, volleyball, hockey and individual sports, such as figure skating, gymnastics, table tennis, diving and badminton.
We develop a toolbox using PaddlePaddle, which supports football, basketball, table tennis and figure skating action recognition.
arXiv Detail & Related papers (2022-06-02T13:19:36Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - IntegralAction: Pose-driven Feature Integration for Robust Human Action
Recognition in Videos [94.06960017351574]
We learn pose-driven feature integration that dynamically combines appearance and pose streams by observing pose features on the fly.
We show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets.
arXiv Detail & Related papers (2020-07-13T11:24:48Z) - Decoupling Video and Human Motion: Towards Practical Event Detection in
Athlete Recordings [33.770877823910176]
We propose to use 2D human pose sequences as an intermediate representation that decouples human motion from the raw video information.
We describe two approaches to event detection on pose sequences and evaluate them in complementary domains: swimming and athletics.
Our approach is not limited to these domains and shows the flexibility of pose-based motion event detection.
arXiv Detail & Related papers (2020-04-21T07:06:12Z) - Human Motion Transfer from Poses in the Wild [61.6016458288803]
We tackle the problem of human motion transfer, where we synthesize novel motion video for a target person that imitates the movement from a reference video.
It is a video-to-video translation task in which the estimated poses are used to bridge two domains.
We introduce a novel pose-to-video translation framework for generating high-quality videos that are temporally coherent even for in-the-wild pose sequences unseen during training.
arXiv Detail & Related papers (2020-04-07T05:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.