Human-to-Human Interaction Detection
- URL: http://arxiv.org/abs/2307.00464v2
- Date: Fri, 11 Aug 2023 10:08:46 GMT
- Title: Human-to-Human Interaction Detection
- Authors: Zhenhua Wang, Kaining Ying, Jiajun Meng, Jifeng Ning
- Abstract summary: We introduce a new task named human-to-human interaction detection (HID)
HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model.
First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I)
- Score: 3.00604614803979
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A comprehensive understanding of interested human-to-human interactions in
video streams, such as queuing, handshaking, fighting and chasing, is of
immense importance to the surveillance of public security in regions like
campuses, squares and parks. Different from conventional human interaction
recognition, which uses choreographed videos as inputs, neglects concurrent
interactive groups, and performs detection and recognition in separate stages,
we introduce a new task named human-to-human interaction detection (HID). HID
devotes to detecting subjects, recognizing person-wise actions, and grouping
people according to their interactive relations, in one model. First, based on
the popular AVA dataset created for action detection, we establish a new HID
benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive
relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and
86,338 interactive groups, and each image includes up to 4 concurrent
interactive groups. Second, we present a novel baseline approach SaMFormer for
HID, containing a visual feature extractor, a split stage which leverages a
Transformer-based model to decode action instances and interactive groups, and
a merging stage which reconstructs the relationship between instances and
groups. All SaMFormer components are jointly trained in an end-to-end manner.
Extensive experiments on AVA-I validate the superiority of SaMFormer over
representative methods. The dataset and code will be made public to encourage
more follow-up studies.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Learning Mutual Excitation for Hand-to-Hand and Human-to-Human
Interaction Recognition [22.538114033191313]
We propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution layers.
Me-GC learns mutual information in each layer and each stage of graph convolution operations.
Our proposed me-GC outperforms state-of-the-art GCN-based and Transformer-based methods.
arXiv Detail & Related papers (2024-02-04T10:00:00Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition.
Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information.
Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Joint Engagement Classification using Video Augmentation Techniques for
Multi-person Human-robot Interaction [22.73774398716566]
We present a novel framework for identifying a parent-child dyad's joint engagement.
Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models.
Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context.
arXiv Detail & Related papers (2022-12-28T23:52:55Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.