TransRAC: Encoding Multi-scale Temporal Correlation with Transformers
for Repetitive Action Counting
- URL: http://arxiv.org/abs/2204.01018v1
- Date: Sun, 3 Apr 2022 07:50:18 GMT
- Title: TransRAC: Encoding Multi-scale Temporal Correlation with Transformers
for Repetitive Action Counting
- Authors: Huazhang Hu, Sixun Dong, Yiqun Zhao, Dongze Lian, Zhengxin Li,
Shenghua Gao
- Abstract summary: Existing methods focus on performing repetitive action counting in short videos.
We introduce a new large-scale repetitive action counting dataset covering a wide variety of video lengths.
With the help of fine-grained annotation of action cycles, we propose a density map regression-based method to predict the action period.
- Score: 30.541542156648894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Counting repetitive actions are widely seen in human activities such as
physical exercise. Existing methods focus on performing repetitive action
counting in short videos, which is tough for dealing with longer videos in more
realistic scenarios. In the data-driven era, the degradation of such
generalization capability is mainly attributed to the lack of long video
datasets. To complement this margin, we introduce a new large-scale repetitive
action counting dataset covering a wide variety of video lengths, along with
more realistic situations where action interruption or action inconsistencies
occur in the video. Besides, we also provide a fine-grained annotation of the
action cycles instead of just counting annotation along with a numerical value.
Such a dataset contains 1,451 videos with about 20,000 annotations, which is
more challenging. For repetitive action counting towards more realistic
scenarios, we further propose encoding multi-scale temporal correlation with
transformers that can take into account both performance and efficiency.
Furthermore, with the help of fine-grained annotation of action cycles, we
propose a density map regression-based method to predict the action period,
which yields better performance with sufficient interpretability. Our proposed
method outperforms state-of-the-art methods on all datasets and also achieves
better performance on the unseen dataset without fine-tuning. The dataset and
code are available.
Related papers
- Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting [87.11995635760108]
Key to action counting is accurately locating each video's repetitive actions.
We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
arXiv Detail & Related papers (2024-06-13T05:15:52Z) - Efficient Action Counting with Dynamic Queries [31.833468477101604]
We introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity.
Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation.
Our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds.
arXiv Detail & Related papers (2024-03-03T15:43:11Z) - Full Resolution Repetition Counting [19.676724611655914]
Given an untrimmed video, repetitive actions counting aims to estimate the number of repetitions of class-agnostic actions.
Down-sampling is commonly utilized in recent state-of-the-art methods, leading to ignorance of several repetitive samples.
In this paper, we attempt to understand repetitive actions from a full temporal resolution view, by combining offline feature extraction and temporal convolution networks.
arXiv Detail & Related papers (2023-05-23T07:45:56Z) - Boundary-Denoising for Video Activity Localization [57.9973253014712]
We study the video activity localization problem from a denoising perspective.
Specifically, we propose an encoder-decoder model named DenoiseLoc.
Experiments show that DenoiseLoc advances %in several video activity understanding tasks.
arXiv Detail & Related papers (2023-04-06T08:48:01Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.