TransRAC: Encoding Multi-scale Temporal Correlation with Transformers
for Repetitive Action Counting
- URL: http://arxiv.org/abs/2204.01018v1
- Date: Sun, 3 Apr 2022 07:50:18 GMT
- Title: TransRAC: Encoding Multi-scale Temporal Correlation with Transformers
for Repetitive Action Counting
- Authors: Huazhang Hu, Sixun Dong, Yiqun Zhao, Dongze Lian, Zhengxin Li,
Shenghua Gao
- Abstract summary: Existing methods focus on performing repetitive action counting in short videos.
We introduce a new large-scale repetitive action counting dataset covering a wide variety of video lengths.
With the help of fine-grained annotation of action cycles, we propose a density map regression-based method to predict the action period.
- Score: 30.541542156648894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Counting repetitive actions are widely seen in human activities such as
physical exercise. Existing methods focus on performing repetitive action
counting in short videos, which is tough for dealing with longer videos in more
realistic scenarios. In the data-driven era, the degradation of such
generalization capability is mainly attributed to the lack of long video
datasets. To complement this margin, we introduce a new large-scale repetitive
action counting dataset covering a wide variety of video lengths, along with
more realistic situations where action interruption or action inconsistencies
occur in the video. Besides, we also provide a fine-grained annotation of the
action cycles instead of just counting annotation along with a numerical value.
Such a dataset contains 1,451 videos with about 20,000 annotations, which is
more challenging. For repetitive action counting towards more realistic
scenarios, we further propose encoding multi-scale temporal correlation with
transformers that can take into account both performance and efficiency.
Furthermore, with the help of fine-grained annotation of action cycles, we
propose a density map regression-based method to predict the action period,
which yields better performance with sufficient interpretability. Our proposed
method outperforms state-of-the-art methods on all datasets and also achieves
better performance on the unseen dataset without fine-tuning. The dataset and
code are available.
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process.
We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z) - FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement.
This method can accurately identify the start and end boundaries of actions in the query video.
Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z) - Efficient Action Counting with Dynamic Queries [31.833468477101604]
We introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity.
Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation.
Our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds.
arXiv Detail & Related papers (2024-03-03T15:43:11Z) - Full Resolution Repetition Counting [19.676724611655914]
Given an untrimmed video, repetitive actions counting aims to estimate the number of repetitions of class-agnostic actions.
Down-sampling is commonly utilized in recent state-of-the-art methods, leading to ignorance of several repetitive samples.
In this paper, we attempt to understand repetitive actions from a full temporal resolution view, by combining offline feature extraction and temporal convolution networks.
arXiv Detail & Related papers (2023-05-23T07:45:56Z) - Boundary-Denoising for Video Activity Localization [57.9973253014712]
We study the video activity localization problem from a denoising perspective.
Specifically, we propose an encoder-decoder model named DenoiseLoc.
Experiments show that DenoiseLoc advances %in several video activity understanding tasks.
arXiv Detail & Related papers (2023-04-06T08:48:01Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Temporal Action Localization with Multi-temporal Scales [54.69057924183867]
We propose to predict actions on a feature space of multi-temporal scales.
Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales.
The proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively.
arXiv Detail & Related papers (2022-08-16T01:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.