Skating-Mixer: Multimodal MLP for Scoring Figure Skating
- URL: http://arxiv.org/abs/2203.03990v1
- Date: Tue, 8 Mar 2022 10:36:55 GMT
- Title: Skating-Mixer: Multimodal MLP for Scoring Figure Skating
- Authors: Jingfei Xia, Mingchen Zhuge, Tiantian Geng, Shun Fan, Yuantai Wei,
Zhenyu He and Feng Zheng
- Abstract summary: We introduce a multimodal architecture, named Skating-Mixer.
It effectively learns long-term representations through our designed memory recurrent unit (MRU)
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and our FS1000 dataset.
- Score: 31.346611498891964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Figure skating scoring is a challenging task because it requires judging
players' technical moves as well as coordination with the background music.
Prior learning-based work cannot solve it well for two reasons: 1) each move in
figure skating changes quickly, hence simply applying traditional frame
sampling will lose a lot of valuable information, especially in a 3-5 minutes
lasting video, so an extremely long-range representation learning is necessary;
2) prior methods rarely considered the critical audio-visual relationship in
their models. Thus, we introduce a multimodal MLP architecture, named
Skating-Mixer. It extends the MLP-Mixer-based framework into a multimodal
fashion and effectively learns long-term representations through our designed
memory recurrent unit (MRU). Aside from the model, we also collected a
high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8
types of programs with 7 different rating metrics, overtaking other datasets in
both quantity and diversity. Experiments show the proposed method outperforms
SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In
addition, we include an analysis applying our method to recent competitions
that occurred in Beijing 2022 Winter Olympic Games, proving our method has
strong robustness.
Related papers
- YourSkatingCoach: A Figure Skating Video Benchmark for Fine-Grained Element Analysis [10.444961818248624]
dataset contains 454 videos of jump elements, the detected skater skeletons in each video, along with the gold labels of the start and ending frames of each jump, together as a video benchmark for figure skating.
We propose air time detection, a novel motion analysis task, the goal of which is to accurately detect the duration of the air time of a jump.
To verify the generalizability of the fine-grained labels, we apply the same process to other sports as cross-sports tasks but for coarse-grained task action classification.
arXiv Detail & Related papers (2024-10-27T12:52:28Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022 [0.0]
We describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
The top performed system is a fusion of 7 models, which contains 3 different types of model architectures.
The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
arXiv Detail & Related papers (2022-09-21T06:54:24Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Unsupervised Temporal Feature Aggregation for Event Detection in
Unstructured Sports Videos [10.230408415438966]
We study the case of event detection in sports videos for unstructured environments with arbitrary camera angles.
We identify and solve two major problems: unsupervised identification of players in an unstructured setting and generalization of the trained models to pose variations due to arbitrary shooting angles.
arXiv Detail & Related papers (2020-02-19T10:24:22Z) - FSD-10: A Dataset for Competitive Sports Content Analysis [29.62110021022271]
Figure Skating dataset (FSD-10) is designed to have a large collection of finegrained actions.
Each clip is at a rate of 30 frames per second with resolution 1080 $times$ 720.
We evaluate state-of-the-art action recognition methods on FSD-10.
arXiv Detail & Related papers (2020-02-09T08:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.