Actor-Context-Actor Relation Network for Spatio-Temporal Action
Localization
- URL: http://arxiv.org/abs/2006.07976v3
- Date: Tue, 20 Apr 2021 20:30:27 GMT
- Title: Actor-Context-Actor Relation Network for Spatio-Temporal Action
Localization
- Authors: Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng
Li
- Abstract summary: ACAR-Net builds upon a novel High-order Relation Reasoning Operator to enable indirect reasoning fortemporal action localization.
Our method ranks first in the AVA-Kineticsaction localization task of ActivityNet Challenge 2020.
- Score: 47.61419011906561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing persons and recognizing their actions from videos is a challenging
task towards high-level video understanding. Recent advances have been achieved
by modeling direct pairwise relations between entities. In this paper, we take
one step further, not only model direct relations between pairs but also take
into account indirect higher-order relations established upon multiple
elements. We propose to explicitly model the Actor-Context-Actor Relation,
which is the relation between two actors based on their interactions with the
context. To this end, we design an Actor-Context-Actor Relation Network
(ACAR-Net) which builds upon a novel High-order Relation Reasoning Operator and
an Actor-Context Feature Bank to enable indirect relation reasoning for
spatio-temporal action localization. Experiments on AVA and UCF101-24 datasets
show the advantages of modeling actor-context-actor relations, and
visualization of attention maps further verifies that our model is capable of
finding relevant higher-order relations to support action detection. Notably,
our method ranks first in the AVA-Kineticsaction localization task of
ActivityNet Challenge 2020, out-performing other entries by a significant
margin (+6.71mAP). Training code and models will be available at
https://github.com/Siyu-C/ACAR-Net.
Related papers
- Action Class Relation Detection and Classification Across Multiple Video
Datasets [1.15520000056402]
We consider two new machine learning tasks: action class relation detection and classification.
We propose a unified model to predict relations between action classes, using language and visual information associated with classes.
Experimental results show that (i) pre-trained recent neural network models for texts and videos contribute to high predictive performance, (ii) the relation prediction based on action label texts is more accurate than based on videos, and (iii) a blending approach can further improve the predictive performance in some cases.
arXiv Detail & Related papers (2023-08-15T03:56:46Z) - MRSN: Multi-Relation Support Network for Video Action Detection [15.82531313330869]
Action detection is a challenging video understanding task requiring modeling relations.
We propose a novel network called Multi-temporallation Supportarity Network.
Our experiments demonstrate that modeling relations separately and performing relation-level interactions can achieve state-of-the-art results.
arXiv Detail & Related papers (2023-04-24T10:15:31Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - CycleACR: Cycle Modeling of Actor-Context Relations for Video Action
Detection [67.90338302559672]
We propose to select actor-related scene context, rather than directly leverage raw video scenario, to improve relation modeling.
We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form.
Compared to existing designs that focus on C2A-E, our CycleACR introduces A2C-R for a more effective relation modeling.
arXiv Detail & Related papers (2023-03-28T16:40:47Z) - Graph Convolutional Module for Temporal Action Localization in Videos [142.5947904572949]
We claim that the relations between action units play an important role in action localization.
A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it.
We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
arXiv Detail & Related papers (2021-12-01T06:36:59Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.