Related papers: Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos

Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos

URL: http://arxiv.org/abs/2008.05977v1
Date: Thu, 13 Aug 2020 15:51:42 GMT
Title: Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos
Authors: Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai
Abstract summary: We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames. We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
Score: 96.45804577283563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches. The codes and dataset are available at \url{https://github.com/lingan1996/ACTION-NET}.

Related papers

Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Benchmarking Badminton Action Recognition with a New Fine-Grained Dataset [16.407837909069073]
We introduce the VideoBadminton dataset derived from high-quality badminton footage. The introduction of VideoBadminton could not only serve for badminton action recognition but also provide a dataset for recognizing fine-grained actions.
arXiv Detail & Related papers (2024-03-19T02:52:06Z)
Few-shot Action Recognition via Intra- and Inter-Video Information Maximization [28.31541961943443]
We propose a novel framework, Video Information Maximization (VIM), for few-shot action recognition. VIM is equipped with an adaptive spatial-temporal video sampler and atemporal action alignment model. VIM acts to maximize the distinctiveness of video information from limited video data.
arXiv Detail & Related papers (2023-05-10T13:05:43Z)
Towards Active Learning for Action Spotting in Association Football Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns. Current algorithms face significant challenges when learning from limited annotated data. We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z)
Sports Video Analysis on Large-Scale Data [10.24207108909385]
This paper investigates the modeling of automated machine description on sports video. We propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning.
arXiv Detail & Related papers (2022-08-09T16:59:24Z)
Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action. Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z)
Video Salient Object Detection via Contrastive Features and Attention Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection. A co-attention formulation is utilized to combine the low-level and high-level features. We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z)
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object. We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions. We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z)
HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z)
OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement. To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.