IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
- URL: http://arxiv.org/abs/2403.11959v2
- Date: Wed, 20 Mar 2024 11:58:23 GMT
- Title: IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
- Authors: Hang Wang, Zhi-Qi Cheng, Youtian Du, Lei Zhang,
- Abstract summary: Video Action Counting (VAC) is crucial in analyzing repetitive actions in videos.
Traditional methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration.
We introduce Irregular Video Action Counting (IVAC), which prioritizes modeling irregular repetition patterns in videos.
- Score: 24.596979713593765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Counting (IVAC). IVAC prioritizes modeling irregular repetition patterns in videos, which we define through two primary aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures homogeneity in the spatial-temporal representations of cycle segments, signifying action uniformity within cycles. Cycle-interval inconsistency highlights the importance of distinguishing between cycle segments and intervals based on their inherent content differences. To encapsulate these principles, we propose a new methodology that includes consistency and inconsistency modules, supported by a unique pull-push loss (P2L) mechanism. The IVAC-P2L model applies a pull loss to promote coherence among cycle segment features and a push loss to clearly distinguish features of cycle segments from interval segments. Empirical evaluations conducted on the RepCount dataset demonstrate that the IVAC-P2L model sets a new benchmark in VAC task performance. Furthermore, the model demonstrates exceptional adaptability and generalization across various video contents, outperforming existing models on two additional datasets, UCFRep and Countix, without the need for dataset-specific optimization. These results confirm the efficacy of our approach in addressing irregular repetitions in videos and pave the way for further advancements in video analysis and understanding.
Related papers
- Localization-Aware Multi-Scale Representation Learning for Repetitive Action Counting [19.546761142820376]
Repetitive action counting (RAC) aims to estimate the number of class-agnostic action occurrences in a video without exemplars.
Most current RAC methods rely on a raw frame-to-frame similarity representation for period prediction.
We introduce a foreground localization objective into similarity representation learning to obtain more robust and efficient video features.
arXiv Detail & Related papers (2025-01-13T13:24:41Z) - Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency [9.115508086522887]
We introduce a weakly-supervised method called Eigen VIS that achieves competitive accuracy compared to other VIS approaches.
This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Co-efficient (QCC)
The code is available on https://github.com/farnooshar/EigenVIS.
arXiv Detail & Related papers (2024-08-29T16:05:05Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - CTVIS: Consistent Training for Online Video Instance Segmentation [62.957370691452844]
Discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS)
Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings.
We propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines.
arXiv Detail & Related papers (2023-07-24T08:44:25Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Deep Explicit Duration Switching Models for Time Series [84.33678003781908]
We propose a flexible model that is capable of identifying both state- and time-dependent switching dynamics.
State-dependent switching is enabled by a recurrent state-to-switch connection.
An explicit duration count variable is used to improve the time-dependent switching behavior.
arXiv Detail & Related papers (2021-10-26T17:35:21Z) - Context-aware and Scale-insensitive Temporal Repetition Counting [60.40438811580856]
Temporal repetition counting aims to estimate the number of cycles of a given repetitive action.
Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life.
We propose a context-aware and scale-insensitive framework to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths.
arXiv Detail & Related papers (2020-05-18T05:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.