What to look at and where: Semantic and Spatial Refined Transformer for
detecting human-object interactions
- URL: http://arxiv.org/abs/2204.00746v1
- Date: Sat, 2 Apr 2022 02:41:31 GMT
- Title: What to look at and where: Semantic and Spatial Refined Transformer for
detecting human-object interactions
- Authors: A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe,
Davide Modolo
- Abstract summary: We propose a one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task.
Two new modules help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features.
These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
- Score: 26.87434934565539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel one-stage Transformer-based semantic and spatial refined
transformer (SSRT) to solve the Human-Object Interaction detection task, which
requires to localize humans and objects, and predicts their interactions.
Differently from previous Transformer-based HOI approaches, which mostly focus
at improving the design of the decoder outputs for the final detection, SSRT
introduces two new modules to help select the most relevant object-action pairs
within an image and refine the queries' representation using rich semantic and
spatial features. These enhancements lead to state-of-the-art results on the
two most popular HOI benchmarks: V-COCO and HICO-DET.
Related papers
- Geometric Features Enhanced Human-Object Interaction Detection [11.513009304308724]
We propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI)
One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet.
GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions.
arXiv Detail & Related papers (2024-06-26T18:52:53Z) - Neural-Logic Human-Object Interaction Detection [67.4993347702353]
We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities.
Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the human, action, object> triplet and constitute novel interactions.
We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities.
arXiv Detail & Related papers (2023-11-16T11:47:53Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Exploring Structure-aware Transformer over Interaction Proposals for
Human-Object Interaction Detection [119.93025368028083]
We design a novel Transformer-style Human-Object Interaction (HOI) detector, i.e., Structure-aware Transformer over Interaction Proposals (STIP)
STIP decomposes the process of HOI set prediction into two subsequent phases, i.e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer.
The structure-aware Transformer upgrades vanilla Transformer by encoding additionally the holistically semantic structure among interaction proposals as well as the locally spatial structure of human/object within each interaction proposal, so as to strengthen HOI
arXiv Detail & Related papers (2022-06-13T16:21:08Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction
Detection [21.296007737406494]
Human-Object Interaction (HOI) detection is the task of identifying a set of human, object, interaction> triplets from an image.
Recent work proposed transformer encoder-decoder architectures that successfully eliminated the need for many hand-designed components in HOI detection.
We propose a Multi-Scale TRansformer (MSTR) for HOI detection powered by two novel HOI-aware deformable attention modules.
arXiv Detail & Related papers (2022-03-28T12:58:59Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.