Graph Convolutional Module for Temporal Action Localization in Videos
- URL: http://arxiv.org/abs/2112.00302v1
- Date: Wed, 1 Dec 2021 06:36:59 GMT
- Title: Graph Convolutional Module for Temporal Action Localization in Videos
- Authors: Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou
Huang, Chuang Gan
- Abstract summary: We claim that the relations between action units play an important role in action localization.
A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it.
We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
- Score: 142.5947904572949
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Temporal action localization has long been researched in computer vision.
Existing state-of-the-art action localization methods divide each video into
multiple action units (i.e., proposals in two-stage methods and segments in
one-stage methods) and then perform action recognition/regression on each of
them individually, without explicitly exploiting their relations during
learning. In this paper, we claim that the relations between action units play
an important role in action localization, and a more powerful action detector
should not only capture the local content of each action unit but also allow a
wider field of view on the context related to it. To this end, we propose a
general graph convolutional module (GCM) that can be easily plugged into
existing action localization methods, including two-stage and one-stage
paradigms. To be specific, we first construct a graph, where each action unit
is represented as a node and their relations between two action units as an
edge. Here, we use two types of relations, one for capturing the temporal
connections between different action units, and the other one for
characterizing their semantic relationship. Particularly for the temporal
connections in two-stage methods, we further explore two different kinds of
edges, one connecting the overlapping action units and the other one connecting
surrounding but disjointed units. Upon the graph we built, we then apply graph
convolutional networks (GCNs) to model the relations among different action
units, which is able to learn more informative representations to enhance
action localization. Experimental results show that our GCM consistently
improves the performance of existing action localization methods, including
two-stage methods (e.g., CBR and R-C3D) and one-stage methods (e.g., D-SSAD),
verifying the generality and effectiveness of our GCM.
Related papers
- JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Modeling Multi-Label Action Dependencies for Temporal Action
Localization [53.53490517832068]
Real-world videos contain many complex actions with inherent relationships between action classes.
We propose an attention-based architecture that models these action relationships for the task of temporal action localization in unoccurrence videos.
We show improved performance over state-of-the-art methods on multi-label action localization benchmarks.
arXiv Detail & Related papers (2021-03-04T13:37:28Z) - Action Graphs: Weakly-supervised Action Localization with Graph
Convolution Networks [25.342482374259017]
We present a method for weakly-supervised action localization based on graph convolutions.
Our method utilizes similarity graphs that encode appearance and motion, and pushes the state of the art on THUMOS '14, ActivityNet 1.2, and Charades for weakly supervised action localization.
arXiv Detail & Related papers (2020-02-04T18:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.