FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2301.04019v1
- Date: Sun, 8 Jan 2023 03:53:50 GMT
- Title: FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection
- Authors: Shuailei Ma, Yuefeng Wang, Shanze Wang and Ying Wei
- Abstract summary: A novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems.
FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM)
- Score: 4.534713782093219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI), as an important problem in computer vision,
requires locating the human-object pair and identifying the interactive
relationships between them. The HOI instance has a greater span in spatial,
scale, and task than the individual object instance, making its detection more
susceptible to noisy backgrounds. To alleviate the disturbance of noisy
backgrounds on HOI detection, it is necessary to consider the input image
information to generate fine-grained anchors which are then leveraged to guide
the detection of HOI instances. However, it is challenging for the following
reasons. i) how to extract pivotal features from the images with complex
background information is still an open question. ii) how to semantically align
the extracted features and query embeddings is also a difficult issue. In this
paper, a novel end-to-end transformer-based framework (FGAHOI) is proposed to
alleviate the above problems. FGAHOI comprises three dedicated components
namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM)
and task-aware merging mechanism (TAM). MSS extracts features of humans,
objects and interaction areas from noisy backgrounds for HOI instances of
various scales. HSAM and TAM semantically align and merge the extracted
features and query embeddings in the hierarchical spatial and task perspectives
in turn. In the meanwhile, a novel training strategy Stage-wise Training
Strategy is designed to reduce the training pressure caused by overly complex
tasks done by FGAHOI. In addition, we propose two ways to measure the
difficulty of HOI detection and a novel dataset, i.e., HOI-SDC for the two
challenges (Uneven Distributed Area in Human-Object Pairs and Long Distance
Visual Modeling of Human-Object Pairs) of HOI instances detection.
Related papers
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs [5.891295920078768]
We introduce an advanced approach for fine-grained object visual key field detection.
First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images.
Next, we use Vision Studio to extract semantic object descriptions.
Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map.
arXiv Detail & Related papers (2024-04-01T14:53:36Z) - Small Object Detection via Coarse-to-fine Proposal Generation and
Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.
CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z) - PS-ARM: An End-to-End Attention-aware Relation Mixer Network for Person
Search [56.02761592710612]
We propose a novel attention-aware relation mixer (ARM) for module person search.
Our ARM module is native and does not rely on fine-grained supervision or topological assumptions.
Our PS-ARM achieves state-of-the-art performance on both datasets.
arXiv Detail & Related papers (2022-10-07T10:04:12Z) - Self-Supervised Interactive Object Segmentation Through a
Singulation-and-Grasping Approach [9.029861710944704]
We propose a robot learning approach to interact with novel objects and collect each object's training label.
The Singulation-and-Grasping (SaG) policy is trained through end-to-end reinforcement learning.
Our system achieves 70% singulation success rate in simulated cluttered scenes.
arXiv Detail & Related papers (2022-07-19T15:01:36Z) - MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction
Detection [21.296007737406494]
Human-Object Interaction (HOI) detection is the task of identifying a set of human, object, interaction> triplets from an image.
Recent work proposed transformer encoder-decoder architectures that successfully eliminated the need for many hand-designed components in HOI detection.
We propose a Multi-Scale TRansformer (MSTR) for HOI detection powered by two novel HOI-aware deformable attention modules.
arXiv Detail & Related papers (2022-03-28T12:58:59Z) - QAHOI: Query-Based Anchors for Human-Object Interaction Detection [29.548384966666013]
One-stage approaches have become a new trend for this task due to their high efficiency.
We propose a transformer-based method, QAHOI, which uses query-based anchors to predict all the elements of an HOI instance.
We investigate that a powerful backbone significantly increases accuracy for QAHOI, and QAHOI with a transformer-based backbone outperforms recent state-of-the-art methods by large margins on the HICO-DET benchmark.
arXiv Detail & Related papers (2021-12-16T05:52:23Z) - Batch Exploration with Examples for Scalable Robotic Reinforcement
Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states.
BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.