Related papers: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

URL: http://arxiv.org/abs/2406.08814v1
Date: Thu, 13 Jun 2024 05:15:52 GMT
Title: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting
Authors: Zhengqi Zhao, Xiaohu Huang, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng,
Abstract summary: Key to action counting is accurately locating each video's repetitive actions. We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
Score: 87.11995635760108
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.

Related papers

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? [1.1288535170985818]
We introduce AAG, a method for Action Anticipation at a Glimpse.<n>AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning.<n>Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively.
arXiv Detail & Related papers (2025-12-02T14:57:17Z)
Leveraging Scene Context with Dual Networks for Sequential User Behavior Modeling [58.72480539725212]
We propose a novel Dual Sequence Prediction networks (DSPnet) to capture the dynamic interests and interplay between scenes and items for future behavior prediction.<n>DSPnet consists of two parallel networks dedicated to learn users' dynamic interests over items and scenes, and a sequence feature enhancement module to capture the interplay for enhanced future behavior prediction.
arXiv Detail & Related papers (2025-09-30T12:26:57Z)
Multi-level and Multi-modal Action Anticipation [12.921307214813357]
Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems.<n>We introduce textitMulti-level and Multi-modal Action Anticipation (m&m-Ant), a novel multi-modal action anticipation approach.<n>Experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-03T02:39:33Z)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism. Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions. We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z)
Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input. We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way. Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores. The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2021-05-24T07:24:39Z)
Learning Salient Boundary Feature for Anchor-free Temporal Action Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding. We propose the first purely anchor-free temporal localization method. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting [22.86745487695168]
We propose a baseline based on multi-instance and multi-label learning. We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
arXiv Detail & Related papers (2021-01-21T11:59:47Z)
Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames. We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression. Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z)
SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. We propose a unified system called SF-Net to make use of such single-frame supervision. SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.