Related papers: GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

URL: http://arxiv.org/abs/2510.25301v1
Date: Wed, 29 Oct 2025 09:14:07 GMT
Title: GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction
Authors: Yang Jin, Guangyu Guo, Binglu Wang,
Abstract summary: GaTector+ is a unified framework for gaze object detection and gaze following.<n>We first embed a head detection branch to predict the head of each person.<n>Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location.
Score: 25.92263916002385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

Related papers

Revisiting Salient Object Detection from an Observer-Centric Perspective [48.99721284788945]
We propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents.<n>As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction.
arXiv Detail & Related papers (2026-02-06T03:53:01Z)
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation [51.25997439181537]
CoPRS bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap.<n>A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder.
arXiv Detail & Related papers (2025-10-13T09:07:54Z)
Differential Contrastive Training for Gaze Estimation [24.53837441433775]
We propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP.<n>A Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced.
arXiv Detail & Related papers (2025-02-27T14:23:20Z)
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders [33.26237143983192]
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.<n>We propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder.
arXiv Detail & Related papers (2024-12-12T18:55:30Z)
Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model [19.800353299691277]
This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. We propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world.
arXiv Detail & Related papers (2024-08-02T06:32:45Z)
GazeHTA: End-to-end Gaze Target Detection with Head-Target Association [12.38704128536528]
We propose an end-to-end approach for gaze target detection.<n>GazeHTA predicts a head-target connection between individuals and the target image regions they are looking at.<n>Our experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods.
arXiv Detail & Related papers (2024-04-16T16:51:27Z)
Joint Gaze-Location and Gaze-Object Detection [62.69261709635086]
Current approaches frame gaze location detection (GL-D) and gaze object detection (GO-D) as two separate tasks. We propose GTR, short for underlineGaze following detection underlineTRansformer, to streamline the gaze following detection pipeline. GTR achieves a 12.1 mAP gain on GazeFollowing and a 18.2 mAP gain on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement on GOO-Real for GO-D.
arXiv Detail & Related papers (2023-08-26T12:12:24Z)
Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene. The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z)
DisenHCN: Disentangled Hypergraph Convolutional Networks for Spatiotemporal Activity Prediction [53.76601630407521]
We propose a hypergraph network model called DisenHCN to bridge the gaps in existing solutions. In particular, we first unify fine-grained user similarity and the complex matching between user preferences andtemporal activity into a heterogeneous hypergraph. We then disentangle the user representations into different aspects (location-aware, time-aware, and activity-aware) and aggregate corresponding aspect's features on the constructed hypergraph.
arXiv Detail & Related papers (2022-08-14T06:51:54Z)
End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following. Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components. The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z)
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones. SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders. Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z)
GaTector: A Unified Framework for Gaze Object Prediction [11.456242421204298]
We build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way. To better consider the specificity of inputs and tasks, GaTector introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone. In the end, we propose a novel wUoC metric that can reveal the difference between boxes even when they share no overlapping area.
arXiv Detail & Related papers (2021-12-07T07:50:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.