Related papers: Sharingan: A Transformer-based Architecture for Gaze Following

Sharingan: A Transformer-based Architecture for Gaze Following

URL: http://arxiv.org/abs/2310.00816v1
Date: Sun, 1 Oct 2023 23:14:54 GMT
Title: Sharingan: A Transformer-based Architecture for Gaze Following
Authors: Samy Tafasca, Anshul Gupta, Jean-Marc Odobez
Abstract summary: We introduce a novel transformer-based architecture for 2D gaze prediction. This paper achieves state-of-the-art results on the GazeFollow and VideoTarget datasets.
Score: 14.594691605523005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gaze is a powerful form of non-verbal communication and social interaction that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, Gaze Following is defined as the prediction of the pixel-wise 2D location where a person in the image is looking. Prior efforts in this direction have focused primarily on CNN-based architectures to perform the task. In this paper, we introduce a novel transformer-based architecture for 2D gaze prediction. We experiment with 2 variants: the first one retains the same task formulation of predicting a gaze heatmap for one person at a time, while the second one casts the problem as a 2D point regression and allows us to perform multi-person gaze prediction with a single forward pass. This new architecture achieves state-of-the-art results on the GazeFollow and VideoAttentionTarget datasets. The code for this paper will be made publicly available.

Related papers

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation [53.09168514034483]
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions.<n>We propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model.<n>Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.
arXiv Detail & Related papers (2026-02-27T08:54:20Z)
GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations [14.82916312780764]
We propose GazeDETR, a novel end-to-end architecture with two disentangled decoders.<n>Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets.
arXiv Detail & Related papers (2025-08-18T14:41:18Z)
3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation [27.51272922798475]
We introduce a novel cross-task 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices.
arXiv Detail & Related papers (2025-02-06T13:37:09Z)
Towards Robust and Realistic Human Pose Estimation via WiFi Signals [85.60557095666934]
WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding.
arXiv Detail & Related papers (2025-01-16T09:38:22Z)
Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior. Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z)
Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z)
BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively. We consider different backbone architectures and pre-train them using the DINO self-supervised learning method. Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z)
L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments [2.5234156040689237]
We propose a robust CNN-based model for predicting gaze in unconstrained settings. We use two identical losses, one for each angle, to improve network learning and increase its generalization. Our proposed model achieves state-of-the-art accuracy of 3.92deg and 10.41deg on MPIIGaze and Gaze360 datasets, respectively.
arXiv Detail & Related papers (2022-03-07T12:35:39Z)
A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction [1.1816942730023883]
We introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs. Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs. We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance.
arXiv Detail & Related papers (2021-10-25T21:40:42Z)
Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation. And we propose three task-specific graph neural networks for effective message passing. Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z)
GPRAR: Graph Convolutional Network based Pose Reconstruction and Action Recognition for Human Trajectory Prediction [1.2891210250935146]
Existing prediction models are easily prone to errors in real-world settings where observations are often noisy. We introduce GPRAR, a graph convolutional network based pose reconstruction and action recognition for human trajectory prediction. We show that GPRAR improves the prediction accuracy up to 22% and 50% under noisy observations on JAAD and TITAN datasets.
arXiv Detail & Related papers (2021-03-25T20:12:14Z)
3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
End-to-end Contextual Perception and Prediction with Interaction Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture. Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
Socially and Contextually Aware Human Motion and Pose Forecasting [48.083060946226]
We propose a novel framework to tackle both tasks of human motion (or skeleton pose) and body skeleton pose forecasting. We consider incorporating both scene and social contexts, as critical clues for this prediction task. Our proposed framework achieves a superior performance compared to several baselines on two social datasets.
arXiv Detail & Related papers (2020-07-14T06:12:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.