Cascaded Human-Object Interaction Recognition
- URL: http://arxiv.org/abs/2003.04262v2
- Date: Wed, 11 Mar 2020 10:54:41 GMT
- Title: Cascaded Human-Object Interaction Recognition
- Authors: Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, Jianbing Shen
- Abstract summary: We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
- Score: 175.60439054047043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid progress has been witnessed for human-object interaction (HOI)
recognition, but most existing models are confined to single-stage reasoning
pipelines. Considering the intrinsic complexity of the task, we introduce a
cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At
each stage, an instance localization network progressively refines HOI
proposals and feeds them into an interaction recognition network. Each of the
two networks is also connected to its predecessor at the previous stage,
enabling cross-stage information propagation. The interaction recognition
network has two crucial parts: a relation ranking module for high-quality HOI
proposal selection and a triple-stream classifier for relation prediction. With
our carefully-designed human-centric relation features, these two modules work
collaboratively towards effective interaction understanding. Further beyond
relation detection on a bounding-box level, we make our framework flexible to
perform fine-grained pixel-wise relation segmentation; this provides a new
glimpse into better relation modeling. Our approach reached the $1^{st}$ place
in the ICCV2019 Person in Context Challenge, on both relation detection and
segmentation tasks. It also shows promising results on V-COCO.
Related papers
- Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition [8.513434732050749]
We propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations.
Our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities.
To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations.
arXiv Detail & Related papers (2023-07-14T16:51:25Z) - A Hierarchical Interactive Network for Joint Span-based Aspect-Sentiment
Analysis [34.1489054082536]
We propose a hierarchical interactive network (HI-ASA) to model two-way interactions between two tasks appropriately.
We use cross-stitch mechanism to combine the different task-specific features selectively as the input to ensure proper two-way interactions.
Experiments on three real-world datasets demonstrate HI-ASA's superiority over baselines.
arXiv Detail & Related papers (2022-08-24T03:03:49Z) - RR-Net: Injecting Interactive Semantics in Human-Object Interaction
Detection [40.65483058890176]
Latest end-to-end HOI detectors are short of relation reasoning, which leads to inability to learn HOI-specific interactive semantics for predictions.
We first present a progressive Relation-aware Frame, which brings a new structure and parameter sharing pattern for interaction inference.
Based on modules above, we construct an end-to-end trainable framework named Relation Reasoning Network (abbr. RR-Net)
arXiv Detail & Related papers (2021-04-30T14:03:10Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - A Co-Interactive Transformer for Joint Slot Filling and Intent Detection [61.109486326954205]
Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system.
Previous studies either model the two tasks separately or only consider the single information flow from intent to slot.
We propose a Co-Interactive Transformer to consider the cross-impact between the two tasks simultaneously.
arXiv Detail & Related papers (2020-10-08T10:16:52Z) - DCR-Net: A Deep Co-Interactive Relation Network for Joint Dialog Act
Recognition and Sentiment Classification [77.59549450705384]
In dialog system, dialog act recognition and sentiment classification are two correlative tasks.
Most of the existing systems either treat them as separate tasks or just jointly model the two tasks.
We propose a Deep Co-Interactive Relation Network (DCR-Net) to explicitly consider the cross-impact and model the interaction between the two tasks.
arXiv Detail & Related papers (2020-08-16T14:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.