Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
- URL: http://arxiv.org/abs/2408.01044v1
- Date: Fri, 2 Aug 2024 06:32:45 GMT
- Title: Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
- Authors: Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang,
- Abstract summary: This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior.
We propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world.
- Score: 19.800353299691277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.
Related papers
- Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - TransGOP: Transformer-Based Gaze Object Prediction [27.178785186892203]
This paper introduces Transformer into the fields of gaze object prediction.
It proposes an end-to-end Transformer-based gaze object prediction method named TransGOP.
arXiv Detail & Related papers (2024-02-21T07:17:10Z) - PointOBB: Learning Oriented Object Detection via Single Point
Supervision [55.88982271340328]
This paper proposes PointOBB, the first single Point-based OBB generation method, for oriented object detection.
PointOBB operates through the collaborative utilization of three distinctive views: an original view, a resized view, and a rotated/flipped (rot/flp) view.
Experimental results on the DIOR-R and DOTA-v1.0 datasets demonstrate that PointOBB achieves promising performance.
arXiv Detail & Related papers (2023-11-23T15:51:50Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - HOKEM: Human and Object Keypoint-based Extension Module for Human-Object
Interaction Detection [1.2183405753834557]
This paper presents the human and object keypoint-based extension module (HOKEM) as an easy-to-use extension module to improve the accuracy of the conventional detection models.
Experiments using the HOI dataset, V-COCO, showed that HOKEM boosted the accuracy of an appearance-based model by a large margin.
arXiv Detail & Related papers (2023-06-25T14:40:26Z) - Sharp Eyes: A Salient Object Detector Working The Same Way as Human
Visual Characteristics [3.222802562733787]
We propose a sharp eyes network (SENet) that first seperates the object from scene, and then finely segments it.
The proposed method aims to utilize the expanded objects to guide the network obtain complete prediction.
arXiv Detail & Related papers (2023-01-18T11:00:45Z) - Object Detection in Aerial Images with Uncertainty-Aware Graph Network [61.02591506040606]
We propose a novel uncertainty-aware object detection framework with a structured-graph, where nodes and edges are denoted by objects.
We refer to our model as Uncertainty-Aware Graph network for object DETection (UAGDet)
arXiv Detail & Related papers (2022-08-23T07:29:03Z) - GaTector: A Unified Framework for Gaze Object Prediction [11.456242421204298]
We build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way.
To better consider the specificity of inputs and tasks, GaTector introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone.
In the end, we propose a novel wUoC metric that can reveal the difference between boxes even when they share no overlapping area.
arXiv Detail & Related papers (2021-12-07T07:50:03Z) - GOO: A Dataset for Gaze Object Prediction in Retail Environments [11.280648029091537]
We present a new task called gaze object prediction.
The goal is to predict a bounding box for a person's gazed-at object.
To train and evaluate gaze networks on this task, we present the Gaze On Objects dataset.
arXiv Detail & Related papers (2021-05-22T18:55:35Z) - Personal Fixations-Based Object Segmentation with Object Localization
and Boundary Preservation [60.41628937597989]
We focus on Personal Fixations-based Object (PFOS) to address issues in previous studies.
We propose a novel network based on Object Localization and Boundary Preservation (OLBP) to segment the gazed objects.
OLBP is organized in the mixed bottom-up and top-down manner with multiple types of deep supervision.
arXiv Detail & Related papers (2021-01-22T09:20:47Z) - Slender Object Detection: Diagnoses and Improvements [74.40792217534]
In this paper, we are concerned with the detection of a particular type of objects with extreme aspect ratios, namely textbfslender objects.
For a classical object detection method, a drastic drop of $18.9%$ mAP on COCO is observed, if solely evaluated on slender objects.
arXiv Detail & Related papers (2020-11-17T09:39:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.