Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition
- URL: http://arxiv.org/abs/2307.07469v1
- Date: Fri, 14 Jul 2023 16:51:25 GMT
- Title: Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition
- Authors: Yuhang Wen, Zixuan Tang, Yunsheng Pang, Beichen Ding, Mengyuan Liu
- Abstract summary: We propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations.
Our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities.
To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations.
- Score: 8.513434732050749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing interactive action plays an important role in human-robot
interaction and collaboration. Previous methods use late fusion and
co-attention mechanism to capture interactive relations, which have limited
learning capability or inefficiency to adapt to more interacting entities. With
assumption that priors of each entity are already known, they also lack
evaluations on a more general setting addressing the diversity of subjects. To
address these problems, we propose an Interactive Spatiotemporal Token
Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and
interactive relations. Specifically, our network contains a tokenizer to
partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to
represent motions of multiple diverse entities. By extending the entity
dimension, ISTs provide better interactive representations. To jointly learn
along three dimensions in ISTs, multi-head self-attention blocks integrated
with 3D convolutions are designed to capture inter-token correlations. When
modeling correlations, a strict entity ordering is usually irrelevant for
recognizing interactive actions. To this end, Entity Rearrangement is proposed
to eliminate the orderliness in ISTs for interchangeable entities. Extensive
experiments on four datasets verify the effectiveness of ISTA-Net by
outperforming state-of-the-art methods. Our code is publicly available at
https://github.com/Necolizer/ISTA-Net
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Interaction Event Forecasting in Multi-Relational Recursive HyperGraphs: A Temporal Point Process Approach [12.142292322071299]
This work addresses the problem of forecasting higher-order interaction events in multi-relational recursive hypergraphs.
The proposed model, textitRelational Recursive Hyperedge Temporal Point Process (RRHyperTPP), uses an encoder that learns a dynamic node representation based on the historical interaction patterns.
We have experimentally shown that our models perform better than previous state-of-the-art methods for interaction forecasting.
arXiv Detail & Related papers (2024-04-27T15:46:54Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - Dynamic Relation Discovery and Utilization in Multi-Entity Time Series
Forecasting [92.32415130188046]
In many real-world scenarios, there could exist crucial yet implicit relation between entities.
We propose an attentional multi-graph neural network with automatic graph learning (A2GNN) in this work.
arXiv Detail & Related papers (2022-02-18T11:37:04Z) - Multi-Relation Aware Temporal Interaction Network Embedding [6.964492092209715]
Temporal interaction network embedding can effectively mine the information in temporal interaction networks.
Existing temporal interaction network embedding methods only use historical interaction relations to mine neighbor nodes.
We propose a multi-relation aware temporal interaction network embedding method (MRATE)
arXiv Detail & Related papers (2021-10-09T08:28:22Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - DCR-Net: A Deep Co-Interactive Relation Network for Joint Dialog Act
Recognition and Sentiment Classification [77.59549450705384]
In dialog system, dialog act recognition and sentiment classification are two correlative tasks.
Most of the existing systems either treat them as separate tasks or just jointly model the two tasks.
We propose a Deep Co-Interactive Relation Network (DCR-Net) to explicitly consider the cross-impact and model the interaction between the two tasks.
arXiv Detail & Related papers (2020-08-16T14:13:32Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.