AEI: Actors-Environment Interaction with Adaptive Attention for Temporal
Action Proposals Generation
- URL: http://arxiv.org/abs/2110.11474v2
- Date: Mon, 25 Oct 2021 01:12:47 GMT
- Title: AEI: Actors-Environment Interaction with Adaptive Attention for Temporal
Action Proposals Generation
- Authors: Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani,
Minh-Triet Tran, Ngan Le
- Abstract summary: We propose Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation.
AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM)
- Score: 15.360689782405057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans typically perceive the establishment of an action in a video through
the interaction between an actor and the surrounding environment. An action
only starts when the main actor in the video begins to interact with the
environment, while it ends when the main actor stops the interaction. Despite
the great progress in temporal action proposal generation, most existing works
ignore the aforementioned fact and leave their model learning to propose
actions as a black-box. In this paper, we make an attempt to simulate that
ability of a human by proposing Actor Environment Interaction (AEI) network to
improve the video representation for temporal action proposals generation. AEI
contains two modules, i.e., perception-based visual representation (PVR) and
boundary-matching module (BMM). PVR represents each video snippet by taking
human-human relations and humans-environment relations into consideration using
the proposed adaptive attention mechanism. Then, the video representation is
taken by BMM to generate action proposals. AEI is comprehensively evaluated in
ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and
detection tasks, with two boundary-matching architectures (i.e., CNN-based and
GCN-based) and two classifiers (i.e., Unet and P-GCN). Our AEI robustly
outperforms the state-of-the-art methods with remarkable performance and
generalization for both temporal action proposal generation and temporal action
detection.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - Collaboratively Self-supervised Video Representation Learning for Action
Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z) - CycleACR: Cycle Modeling of Actor-Context Relations for Video Action
Detection [67.90338302559672]
We propose to select actor-related scene context, rather than directly leverage raw video scenario, to improve relation modeling.
We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form.
Compared to existing designs that focus on C2A-E, our CycleACR introduces A2C-R for a more effective relation modeling.
arXiv Detail & Related papers (2023-03-28T16:40:47Z) - AOE-Net: Entities Interactions Modeling with Adaptive Attention
Mechanism for Temporal Action Proposals Generation [24.81870045216019]
Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video.
We propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net)
Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM)
arXiv Detail & Related papers (2022-10-05T21:57:25Z) - E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions.
It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z) - ABN: Agent-Aware Boundary Networks for Temporal Action Proposal
Generation [14.755186542366065]
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos.
We propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks.
We show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG.
arXiv Detail & Related papers (2022-03-16T21:06:34Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Agent-Environment Network for Temporal Action Proposal Generation [10.74737201306622]
Temporal action proposal generation aims at localizing temporal intervals containing human actions in untrimmed videos.
Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network.
Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment.
arXiv Detail & Related papers (2021-07-17T23:24:49Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.