AOE-Net: Entities Interactions Modeling with Adaptive Attention
Mechanism for Temporal Action Proposals Generation
- URL: http://arxiv.org/abs/2210.02578v1
- Date: Wed, 5 Oct 2022 21:57:25 GMT
- Title: AOE-Net: Entities Interactions Modeling with Adaptive Attention
Mechanism for Temporal Action Proposals Generation
- Authors: Khoa Vo, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran,
Ngan Le
- Abstract summary: Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video.
We propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net)
Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM)
- Score: 24.81870045216019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action proposal generation (TAPG) is a challenging task, which
requires localizing action intervals in an untrimmed video. Intuitively, we as
humans, perceive an action through the interactions between actors, relevant
objects, and the surrounding environment. Despite the significant progress of
TAPG, a vast majority of existing methods ignore the aforementioned principle
of the human perceiving process by applying a backbone network into a given
video as a black-box. In this paper, we propose to model these interactions
with a multi-modal representation network, namely, Actors-Objects-Environment
Interaction Network (AOE-Net). Our AOE-Net consists of two modules, i.e.,
perception-based multi-modal representation (PMR) and boundary-matching module
(BMM). Additionally, we introduce adaptive attention mechanism (AAM) in PMR to
focus only on main actors (or relevant objects) and model the relationships
among them. PMR module represents each video snippet by a visual-linguistic
feature, in which main actors and surrounding environment are represented by
visual information, whereas relevant objects are depicted by linguistic
features through an image-text model. BMM module processes the sequence of
visual-linguistic features as its input and generates action proposals.
Comprehensive experiments and extensive ablation studies on ActivityNet-1.3 and
THUMOS-14 datasets show that our proposed AOE-Net outperforms previous
state-of-the-art methods with remarkable performance and generalization for
both TAPG and temporal action detection. To prove the robustness and
effectiveness of AOE-Net, we further conduct an ablation study on egocentric
videos, i.e. EPIC-KITCHENS 100 dataset. Source code is available upon
acceptance.
Related papers
- Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection [4.938957922033169]
Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts.
We propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN)
We show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
arXiv Detail & Related papers (2024-09-16T02:53:49Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - ABN: Agent-Aware Boundary Networks for Temporal Action Proposal
Generation [14.755186542366065]
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos.
We propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks.
We show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG.
arXiv Detail & Related papers (2022-03-16T21:06:34Z) - AEI: Actors-Environment Interaction with Adaptive Attention for Temporal
Action Proposals Generation [15.360689782405057]
We propose Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation.
AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM)
arXiv Detail & Related papers (2021-10-21T20:43:42Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.