Holistic Interaction Transformer Network for Action Detection
- URL: http://arxiv.org/abs/2210.12686v1
- Date: Sun, 23 Oct 2022 10:19:37 GMT
- Title: Holistic Interaction Transformer Network for Action Detection
- Authors: Gueter Josmy Faure, Min-Hung Chen, Shang-Hong Lai
- Abstract summary: "HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream.
Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets.
- Score: 15.667833703317124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Actions are about how we interact with the environment, including other
people, objects, and ourselves. In this paper, we propose a novel multi-modal
Holistic Interaction Transformer Network (HIT) that leverages the largely
ignored, but critical hand and pose information essential to most human
actions. The proposed "HIT" network is a comprehensive bi-modal framework that
comprises an RGB stream and a pose stream. Each of them separately models
person, object, and hand interactions. Within each sub-network, an
Intra-Modality Aggregation module (IMA) is introduced that selectively merges
individual interaction units. The resulting features from each modality are
then glued using an Attentive Fusion Mechanism (AFM). Finally, we extract cues
from the temporal context to better classify the occurring actions using cached
memory. Our method significantly outperforms previous approaches on the J-HMDB,
UCF101-24, and MultiSports datasets. We also achieve competitive results on
AVA. The code will be available at https://github.com/joslefaure/HIT.
Related papers
- HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects.
HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition [8.513434732050749]
We propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations.
Our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities.
To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations.
arXiv Detail & Related papers (2023-07-14T16:51:25Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Interaction Region Visual Transformer for Egocentric Action Anticipation [18.873728614415946]
We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
arXiv Detail & Related papers (2022-11-25T15:00:51Z) - AOE-Net: Entities Interactions Modeling with Adaptive Attention
Mechanism for Temporal Action Proposals Generation [24.81870045216019]
Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video.
We propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net)
Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM)
arXiv Detail & Related papers (2022-10-05T21:57:25Z) - Modular Interactive Video Object Segmentation: Interaction-to-Mask,
Propagation and Difference-Aware Fusion [68.45737688496654]
We present a modular interactive VOS framework which decouples interaction-to-mask and mask propagation.
We show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions.
arXiv Detail & Related papers (2021-03-14T14:39:08Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z) - Asynchronous Interaction Aggregation for Action Detection [43.34864954534389]
We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection.
There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance.
arXiv Detail & Related papers (2020-04-16T07:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.