Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
- URL: http://arxiv.org/abs/2412.09586v1
- Date: Thu, 12 Dec 2024 18:55:30 GMT
- Title: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
- Authors: Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg,
- Abstract summary: We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.
We propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder.
- Score: 33.26237143983192
- License:
- Abstract: We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .
Related papers
- Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach [27.84672974344777]
We propose a novel gaze target prediction solution named GazeSeg.
It can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process.
Our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition.
arXiv Detail & Related papers (2024-11-30T01:27:48Z) - Stanceformer: Target-Aware Transformer for Stance Detection [59.69858080492586]
Stance Detection involves discerning the stance expressed in a text towards a specific subject or target.
Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively.
We introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference.
arXiv Detail & Related papers (2024-10-09T17:24:28Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Object-aware Gaze Target Detection [14.587595325977583]
This paper proposes a Transformer-based architecture that automatically detects objects in the scene to build associations between every head and the gazed-head/object.
Our method achieves state-of-the-art results on all metrics for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects.
arXiv Detail & Related papers (2023-07-18T22:04:41Z) - LatentGaze: Cross-Domain Gaze Estimation through Gaze-Aware Analytic
Latent Code Manipulation [0.0]
We propose a gaze-aware analytic manipulation method, based on a data-driven approach with generative adversarial network inversion's disentanglement characteristics.
By utilizing GAN-based encoder-generator process, we shift the input image from the target domain to the source domain image, which a gaze estimator is sufficiently aware.
arXiv Detail & Related papers (2022-09-21T08:05:53Z) - Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene.
The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z) - GaTector: A Unified Framework for Gaze Object Prediction [11.456242421204298]
We build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way.
To better consider the specificity of inputs and tasks, GaTector introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone.
In the end, we propose a novel wUoC metric that can reveal the difference between boxes even when they share no overlapping area.
arXiv Detail & Related papers (2021-12-07T07:50:03Z) - GOO: A Dataset for Gaze Object Prediction in Retail Environments [11.280648029091537]
We present a new task called gaze object prediction.
The goal is to predict a bounding box for a person's gazed-at object.
To train and evaluate gaze networks on this task, we present the Gaze On Objects dataset.
arXiv Detail & Related papers (2021-05-22T18:55:35Z) - Weakly-Supervised Physically Unconstrained Gaze Estimation [80.66438763587904]
We tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.
We propose a training algorithm along with several novel loss functions especially designed for the task.
We show significant improvements in (a) the accuracy of semi-supervised gaze estimation and (b) cross-domain generalization on the state-of-the-art physically unconstrained in-the-wild Gaze360 gaze estimation benchmark.
arXiv Detail & Related papers (2021-05-20T14:58:52Z) - Towards End-to-end Video-based Eye-Tracking [50.0630362419371]
Estimating eye-gaze from images alone is a challenging task due to un-observable person-specific factors.
We propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships.
We demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures.
arXiv Detail & Related papers (2020-07-26T12:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.