Related papers: Efficient Action Counting with Dynamic Queries

Efficient Action Counting with Dynamic Queries

URL: http://arxiv.org/abs/2403.01543v3
Date: Sun, 9 Jun 2024 09:30:34 GMT
Title: Efficient Action Counting with Dynamic Queries
Authors: Zishi Li, Xiaoxuan Ma, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang,
Abstract summary: We introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds.
Score: 31.833468477101604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Based on this representation, we further develop two key components to tackle the essential challenges of temporal repetition counting. Firstly, to facilitate open-set action counting, we propose the dynamic update scheme on action queries. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Secondly, to distinguish between actions of interest and background noise actions, we incorporate inter-query contrastive learning to regularize the video representations corresponding to different action queries. As a result, our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds. On the challenging RepCountA benchmark, we outperform the state-of-the-art method TransRAC by 26.5% in OBO accuracy, with a 22.7% mean error decrease and 94.1% computational burden reduction. Code is available at https://github.com/lizishi/DeTRC.

Related papers

Localization-Aware Multi-Scale Representation Learning for Repetitive Action Counting [19.546761142820376]
Repetitive action counting (RAC) aims to estimate the number of class-agnostic action occurrences in a video without exemplars. Most current RAC methods rely on a raw frame-to-frame similarity representation for period prediction. We introduce a foreground localization objective into similarity representation learning to obtain more robust and efficient video features.
arXiv Detail & Related papers (2025-01-13T13:24:41Z)
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations. We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement. This method can accurately identify the start and end boundaries of actions in the query video. Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z)
FCA-RAC: First Cycle Annotated Repetitive Action Counting [30.253568218869237]
We propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC) FCA-RAC contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions.
arXiv Detail & Related papers (2024-06-18T01:12:43Z)
Online Action Representation using Change Detection and Symbolic Programming [0.3937354192623676]
The proposed method employs a change detection algorithm to automatically segment action sequences. We show the effectiveness of this representation in the downstream task of class repetition detection. The results of the experiments demonstrate that, despite operating online, the proposed method performs better or on par with the existing method.
arXiv Detail & Related papers (2024-05-19T10:31:59Z)
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos. We develop a novel auxiliary task by decoupling these two types of features within a video snippet. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z)
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting [30.541542156648894]
Existing methods focus on performing repetitive action counting in short videos. We introduce a new large-scale repetitive action counting dataset covering a wide variety of video lengths. With the help of fine-grained annotation of action cycles, we propose a density map regression-based method to predict the action period.
arXiv Detail & Related papers (2022-04-03T07:50:18Z)
End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z)
Learning Salient Boundary Feature for Anchor-free Temporal Action Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding. We propose the first purely anchor-free temporal localization method. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
Context-aware and Scale-insensitive Temporal Repetition Counting [60.40438811580856]
Temporal repetition counting aims to estimate the number of cycles of a given repetitive action. Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life. We propose a context-aware and scale-insensitive framework to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths.
arXiv Detail & Related papers (2020-05-18T05:49:48Z)
Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata. Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.