GaTector: A Unified Framework for Gaze Object Prediction
- URL: http://arxiv.org/abs/2112.03549v3
- Date: Sat, 1 Jul 2023 02:29:33 GMT
- Title: GaTector: A Unified Framework for Gaze Object Prediction
- Authors: Binglu Wang, Tao Hu, Baoshan Li, Xiaojuan Chen, Zhijie Zhang
- Abstract summary: We build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way.
To better consider the specificity of inputs and tasks, GaTector introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone.
In the end, we propose a novel wUoC metric that can reveal the difference between boxes even when they share no overlapping area.
- Score: 11.456242421204298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gaze object prediction is a newly proposed task that aims to discover the
objects being stared at by humans. It is of great application significance but
still lacks a unified solution framework. An intuitive solution is to
incorporate an object detection branch into an existing gaze prediction method.
However, previous gaze prediction methods usually use two different networks to
extract features from scene image and head image, which would lead to heavy
network architecture and prevent each branch from joint optimization. In this
paper, we build a novel framework named GaTector to tackle the gaze object
prediction problem in a unified way. Particularly, a specific-general-specific
(SGS) feature extractor is firstly proposed to utilize a shared backbone to
extract general features for both scene and head images. To better consider the
specificity of inputs and tasks, SGS introduces two input-specific blocks
before the shared backbone and three task-specific blocks after the shared
backbone. Specifically, a novel Defocus layer is designed to generate
object-specific features for the object detection task without losing
information or requiring extra computations. Moreover, the energy aggregation
loss is introduced to guide the gaze heatmap to concentrate on the stared box.
In the end, we propose a novel wUoC metric that can reveal the difference
between boxes even when they share no overlapping area. Extensive experiments
on the GOO dataset verify the superiority of our method in all three tracks,
i.e. object detection, gaze estimation, and gaze object prediction.
Related papers
- Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model [19.800353299691277]
This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior.
We propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world.
arXiv Detail & Related papers (2024-08-02T06:32:45Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Sharp Eyes: A Salient Object Detector Working The Same Way as Human
Visual Characteristics [3.222802562733787]
We propose a sharp eyes network (SENet) that first seperates the object from scene, and then finely segments it.
The proposed method aims to utilize the expanded objects to guide the network obtain complete prediction.
arXiv Detail & Related papers (2023-01-18T11:00:45Z) - Instance-Aware Observer Network for Out-of-Distribution Object
Segmentation [94.73449180972239]
We extend the approach of ObsNet by harnessing an instance-wise mask prediction.
We show that our proposed method accurately disentangles in-distribution objects from Out-Of-Distribution objects on three datasets.
arXiv Detail & Related papers (2022-07-18T17:38:40Z) - Spatial Commonsense Graph for Object Localisation in Partial Scenes [36.47035776975184]
We solve object localisation in partial scenes, a new problem of estimating the unknown position of an object given a partial 3D scan of a scene.
The proposed solution is based on a novel scene graph model, the Spatial Commonsense Graph (SCG), where objects are the nodes and edges define pairwise distances between them.
The SCG is used to estimate the unknown position of the target object in two steps: first, we feed the SCG into a novel Proximity Prediction Network, a graph neural network that uses attention to perform distance prediction between the node representing the target object and the nodes representing the observed objects in the
arXiv Detail & Related papers (2022-03-10T14:13:35Z) - GOO: A Dataset for Gaze Object Prediction in Retail Environments [11.280648029091537]
We present a new task called gaze object prediction.
The goal is to predict a bounding box for a person's gazed-at object.
To train and evaluate gaze networks on this task, we present the Gaze On Objects dataset.
arXiv Detail & Related papers (2021-05-22T18:55:35Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Graph Attention Tracking [76.19829750144564]
We propose a simple target-aware Siamese graph attention network for general object tracking.
Experiments on challenging benchmarks including GOT-10k, UAV123, OTB-100 and LaSOT demonstrate that the proposed SiamGAT outperforms many state-of-the-art trackers.
arXiv Detail & Related papers (2020-11-23T04:26:45Z) - Slender Object Detection: Diagnoses and Improvements [74.40792217534]
In this paper, we are concerned with the detection of a particular type of objects with extreme aspect ratios, namely textbfslender objects.
For a classical object detection method, a drastic drop of $18.9%$ mAP on COCO is observed, if solely evaluated on slender objects.
arXiv Detail & Related papers (2020-11-17T09:39:42Z) - Geometry Constrained Weakly Supervised Object Localization [55.17224813345206]
We propose a geometry constrained network, termed GC-Net, for weakly supervised object localization.
The detector predicts the object location defined by a set of coefficients describing a geometric shape.
The generator takes the resulting masked images as input and performs two complementary classification tasks for the object and background.
In contrast to previous approaches, GC-Net is trained end-to-end and predict object location without any post-processing.
arXiv Detail & Related papers (2020-07-19T17:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.