Related papers: GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

URL: http://arxiv.org/abs/2404.10718v2
Date: Fri, 19 Apr 2024 01:19:25 GMT
Title: GazeHTA: End-to-end Gaze Target Detection with Head-Target Association
Authors: Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang,
Abstract summary: We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. We investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image.
Score: 12.38704128536528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

Related papers

Seamless Detection: Unifying Salient Object Detection and Camouflaged Object Detection [73.85890512959861]
We propose a task-agnostic framework to unify Salient Object Detection (SOD) and Camouflaged Object Detection (COD) We design a simple yet effective contextual decoder involving the interval-layer and global context, which achieves an inference speed of 67 fps. Experiments on public SOD and COD datasets demonstrate the superiority of our proposed framework in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-12-22T03:25:43Z)
Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach [27.84672974344777]
We propose a novel gaze target prediction solution named GazeSeg. It can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process. Our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition.
arXiv Detail & Related papers (2024-11-30T01:27:48Z)
A Review of Human-Object Interaction Detection [10.20202786261744]
Human-object interaction (HOI) detection plays a key role in high-level visual understanding. This paper systematically summarizes and discusses the recent work in image-based HOI detection.
arXiv Detail & Related papers (2024-08-20T08:32:39Z)
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities [25.049754180292034]
We address the challenge of unsupervised mistake detection in egocentric video through the analysis of gaze signals. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes.
arXiv Detail & Related papers (2024-06-12T16:29:45Z)
UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection [35.2385914946471]
We propose a one-stage meta-architecture for HOI detection powered by a novel union-level detector. Our one-stage detector for human-object interaction shows a significant reduction in interaction prediction time 4x14x.
arXiv Detail & Related papers (2023-12-19T23:34:43Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Object-aware Gaze Target Detection [14.587595325977583]
This paper proposes a Transformer-based architecture that automatically detects objects in the scene to build associations between every head and the gazed-head/object. Our method achieves state-of-the-art results on all metrics for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects.
arXiv Detail & Related papers (2023-07-18T22:04:41Z)
CLIP the Gap: A Single Domain Generalization Approach for Object Detection [60.20931827772482]
Single Domain Generalization tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. We propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss.
arXiv Detail & Related papers (2023-01-13T12:01:18Z)
Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene. The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z)
Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection [3.0915392100355192]
We argue that the independent use of the bottom-up parsing strategy in HOI is counter-intuitive and could lead to the diffusion of attention. We introduce a novel knowledge-guided top-down attention into HOI, and propose to model the relation parsing as a "look and search" process. We implement the process via unifying the bottom-up and top-down attention in a single encoder-decoder based model.
arXiv Detail & Related papers (2022-07-16T16:42:49Z)
End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following. Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components. The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z)
One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image. We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images. With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z)
Onfocus Detection: Identifying Individual-Camera Eye Contact from Unconstrained Images [81.64699115587167]
Onfocus detection aims at identifying whether the focus of the individual captured by a camera is on the camera or not. We build a large-scale onfocus detection dataset, named as the OnFocus Detection In the Wild (OFDIW) We propose a novel end-to-end deep model, i.e., the eye-context interaction inferring network (ECIIN) for onfocus detection.
arXiv Detail & Related papers (2021-03-29T03:29:09Z)
Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.