TSA-Net: Tube Self-Attention Network for Action Quality Assessment
- URL: http://arxiv.org/abs/2201.03746v1
- Date: Tue, 11 Jan 2022 02:25:27 GMT
- Title: TSA-Net: Tube Self-Attention Network for Action Quality Assessment
- Authors: Shunli Wang, Dingkang Yang, Peng Zhai, Chixiao Chen, Lihua Zhang
- Abstract summary: We propose a Tube Self-Attention Network (TSA-Net) for action quality assessment (AQA)
TSA-Net is with the following merits: 1) High computational efficiency, 2) High flexibility, and 3) The state-of-the art performance.
- Score: 4.220843694492582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, assessing action quality from videos has attracted growing
attention in computer vision community and human computer interaction. Most
existing approaches usually tackle this problem by directly migrating the model
from action recognition tasks, which ignores the intrinsic differences within
the feature map such as foreground and background information. To address this
issue, we propose a Tube Self-Attention Network (TSA-Net) for action quality
assessment (AQA). Specifically, we introduce a single object tracker into AQA
and propose the Tube Self-Attention Module (TSA), which can efficiently
generate rich spatio-temporal contextual information by adopting sparse feature
interactions. The TSA module is embedded in existing video networks to form
TSA-Net. Overall, our TSA-Net is with the following merits: 1) High
computational efficiency, 2) High flexibility, and 3) The state-of-the art
performance. Extensive experiments are conducted on popular action quality
assessment datasets including AQA-7 and MTL-AQA. Besides, a dataset named Fall
Recognition in Figure Skating (FR-FS) is proposed to explore the basic action
assessment in the figure skating scene.
Related papers
- GAIA: Rethinking Action Quality Assessment for AI-Generated Videos [56.047773400426486]
Action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features.
We construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective.
Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively.
arXiv Detail & Related papers (2024-06-10T08:18:07Z) - UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment [23.48816491333345]
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal.
Existing methods typically address these tasks independently due to distinct learning objectives.
We propose Unified vision-language pre-training of Quality and Aesthetics (UniQA) to learn general perceptions of two tasks, thereby benefiting them simultaneously.
arXiv Detail & Related papers (2024-06-03T07:40:10Z) - LOGO: A Long-Form Video Dataset for Group Action Quality Assessment [63.53109605625047]
We construct a new multi-person long-form video dataset for action quality assessment named LOGO.
Our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds.
As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures.
arXiv Detail & Related papers (2024-04-07T17:51:53Z) - Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution Modeling [31.696222064667243]
Action Quality Assessment (AQA) is a task that tries to answer how well an action is carried out.
Existing works on AQA assume that all the training data are visible for training at one time, but do not enable continual learning.
We propose a unified model to learn AQA tasks sequentially without forgetting.
arXiv Detail & Related papers (2023-09-29T10:06:28Z) - A Weak Supervision Approach for Few-Shot Aspect Based Sentiment [39.33888584498155]
Weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in sentiment analysis tasks.
We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks.
Our proposed method preserves the full fine-tuning performance while showing significant improvements (15.84% absolute F1) in the few-shot learning scenario.
arXiv Detail & Related papers (2023-05-19T19:53:54Z) - Assessor360: Multi-sequence Network for Blind Omnidirectional Image
Quality Assessment [50.82681686110528]
Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs)
The quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipeline lacks the modeling of the observer's browsing process.
We propose a novel multi-sequence network for BOIQA called Assessor360, which is derived from the realistic multi-assessor ODI quality assessment procedure.
arXiv Detail & Related papers (2023-05-18T13:55:28Z) - CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised
Video Anomaly Detection [3.146076597280736]
Video anomaly detection (VAD) is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video.
We first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique.
Our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem.
arXiv Detail & Related papers (2022-12-09T22:28:24Z) - Actor-identified Spatiotemporal Action Detection -- Detecting Who Is
Doing What in Videos [29.5205455437899]
Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos.
Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos.
We propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD) to bridge the gap between SAD actor identification.
arXiv Detail & Related papers (2022-08-27T06:51:12Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - Mining Implicit Relevance Feedback from User Behavior for Web Question
Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance.
Our approach significantly improves the accuracy of passage ranking without extra human labeled data.
In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.