Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation
- URL: http://arxiv.org/abs/2512.17673v1
- Date: Fri, 19 Dec 2025 15:15:58 GMT
- Title: Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation
- Authors: Alexandre Personnic, Mihai Bâce,
- Abstract summary: Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
- Score: 50.05866669110754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.
Related papers
- DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment [17.85550556489256]
This paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA)<n>A Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream.<n>A Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling.<n>Finally, a text-guided adaptive fusion strategy is proposed to enable more effective integration of spatial and temporal information.
arXiv Detail & Related papers (2025-04-16T03:20:28Z) - SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion [0.18416014644193066]
VP (VP) model integrates standard attention (SA) and reinforced attention (RFA)<n> Experiments on three benchmark datasets demonstrate that SRV mitigates image quality degradation in RNN-based models.
arXiv Detail & Related papers (2025-04-10T07:36:50Z) - STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video [7.345621536750547]
This paper presents the S-Temporal GraphFormer framework (STGFormer) for 3D human pose estimation in videos.<n>First, we introduce a STG attention mechanism, designed to more effectively leverage the inherent graph distributions of human body.<n>Next, we present a Modulated Hop-wise Regular GCN to independently process temporal and spatial dimensions in parallel.<n>Finally, we demonstrate our method state-of-the-art performance on the Human3.6M and MPIINF-3DHP datasets.
arXiv Detail & Related papers (2024-07-14T06:45:27Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.