Joint Gaze-Location and Gaze-Object Detection
- URL: http://arxiv.org/abs/2308.13857v1
- Date: Sat, 26 Aug 2023 12:12:24 GMT
- Title: Joint Gaze-Location and Gaze-Object Detection
- Authors: Danyang Tu, Wei Shen, Wei Sun, Xiongkuo Min, Guangtao Zhai
- Abstract summary: Current approaches frame gaze location detection (GL-D) and gaze object detection (GO-D) as two separate tasks.
We propose GTR, short for underlineGaze following detection underlineTRansformer, to streamline the gaze following detection pipeline.
GTR achieves a 12.1 mAP gain on GazeFollowing and a 18.2 mAP gain on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement on GOO-Real for GO-D.
- Score: 62.69261709635086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes an efficient and effective method for joint gaze location
detection (GL-D) and gaze object detection (GO-D), \emph{i.e.}, gaze following
detection. Current approaches frame GL-D and GO-D as two separate tasks,
employing a multi-stage framework where human head crops must first be detected
and then be fed into a subsequent GL-D sub-network, which is further followed
by an additional object detector for GO-D. In contrast, we reframe the gaze
following detection task as detecting human head locations and their gaze
followings simultaneously, aiming at jointly detect human gaze location and
gaze object in a unified and single-stage pipeline. To this end, we propose
GTR, short for \underline{G}aze following detection \underline{TR}ansformer,
streamlining the gaze following detection pipeline by eliminating all
additional components, leading to the first unified paradigm that unites GL-D
and GO-D in a fully end-to-end manner. GTR enables an iterative interaction
between holistic semantics and human head features through a hierarchical
structure, inferring the relations of salient objects and human gaze from the
global image context and resulting in an impressive accuracy. Concretely, GTR
achieves a 12.1 mAP gain ($\mathbf{25.1}\%$) on GazeFollowing and a 18.2 mAP
gain ($\mathbf{43.3\%}$) on VideoAttentionTarget for GL-D, as well as a 19 mAP
improvement ($\mathbf{45.2\%}$) on GOO-Real for GO-D. Meanwhile, unlike
existing systems detecting gaze following sequentially due to the need for a
human head as input, GTR has the flexibility to comprehend any number of
people's gaze followings simultaneously, resulting in high efficiency.
Specifically, GTR introduces over a $\times 9$ improvement in FPS and the
relative gap becomes more pronounced as the human number grows.
Related papers
- Global Confidence Degree Based Graph Neural Network for Financial Fraud Detection [3.730504020733928]
This paper presents the concept and calculation formula of Global Confidence Degree (GCD) and thus designs GCD-based GNN (GCD-GNN)
To obtain a precise GCD for each node, we use a multilayer perceptron to transform features and then the new features and the corresponding prototype are used to eliminate unnecessary information.
Experiments on two public datasets demonstrate that GCD-GNN outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-24T14:55:37Z) - Exploring Sparsity in Graph Transformers [67.48149404841925]
Graph Transformers (GTs) have achieved impressive results on various graph-related tasks.
However, the huge computational cost of GTs hinders their deployment and application, especially in resource-constrained environments.
We propose a comprehensive textbfGraph textbfTransformer textbfSParsification (GTSP) framework that helps to reduce the computational complexity of GTs.
arXiv Detail & Related papers (2023-12-09T06:21:44Z) - Object-aware Gaze Target Detection [14.587595325977583]
This paper proposes a Transformer-based architecture that automatically detects objects in the scene to build associations between every head and the gazed-head/object.
Our method achieves state-of-the-art results on all metrics for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects.
arXiv Detail & Related papers (2023-07-18T22:04:41Z) - MGTR: End-to-End Mutual Gaze Detection with Transformer [1.0312968200748118]
We propose a novel one-stage mutual gaze detection framework called Mutual Gaze TRansformer or MGTR.
By designing mutual gaze instance triples, MGTR can detect each human head bounding box and simultaneously infer mutual gaze relationship based on global image information.
Experimental results on two mutual gaze datasets show that our method is able to accelerate mutual gaze detection process without losing performance.
arXiv Detail & Related papers (2022-09-22T11:26:22Z) - End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following.
Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components.
The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z) - Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z) - An Adversarial Human Pose Estimation Network Injected with Graph
Structure [75.08618278188209]
In this paper, we design a novel generative adversarial network (GAN) to improve the localization accuracy of visible joints when some joints are invisible.
The network consists of two simple but efficient modules, Cascade Feature Network (CFN) and Graph Structure Network (GSN)
arXiv Detail & Related papers (2021-03-29T12:07:08Z) - GID-Net: Detecting Human-Object Interaction with Global and Instance
Dependency [67.95192190179975]
We introduce a two-stage trainable reasoning mechanism, referred to as GID block.
GID-Net is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch.
We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-11T11:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.