UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
- URL: http://arxiv.org/abs/2502.02307v2
- Date: Thu, 13 Mar 2025 15:59:03 GMT
- Title: UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
- Authors: Jiawei Qin, Xucong Zhang, Yusuke Sugano,
- Abstract summary: We propose UniGaze, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training.<n>Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation.<n>We demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data.
- Score: 12.680014448486242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.
Related papers
- GM-DF: Generalized Multi-Scenario Deepfake Detection [49.072106087564144]
Existing face forgery detection usually follows the paradigm of training models in a single domain.
In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets.
arXiv Detail & Related papers (2024-06-28T17:42:08Z) - Adapting to Length Shift: FlexiLength Network for Trajectory Prediction [53.637837706712794]
Trajectory prediction plays an important role in various applications, including autonomous driving, robotics, and scene understanding.
Existing approaches mainly focus on developing compact neural networks to increase prediction precision on public datasets, typically employing a standardized input duration.
We introduce a general and effective framework, the FlexiLength Network (FLN), to enhance the robustness of existing trajectory prediction against varying observation periods.
arXiv Detail & Related papers (2024-03-31T17:18:57Z) - CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model [13.890404285565225]
We propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge.
Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task.
arXiv Detail & Related papers (2024-03-08T07:37:21Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Unsupervised Gaze-aware Contrastive Learning with Subject-specific
Condition [6.547550920819356]
ConGaze is a contrastive learning-based framework that learns generic gaze-aware representations across subjects in an unsupervised way.
We introduce gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency.
We also devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations.
arXiv Detail & Related papers (2023-09-08T09:45:19Z) - Improving 2D Human Pose Estimation in Rare Camera Views with Synthetic Data [24.63316659365843]
We introduce RePoGen, an SMPL-based method for generating synthetic humans with comprehensive control over pose and view.
Experiments on top-view datasets and a new dataset of real images with diverse poses show that adding the RePoGen data to the COCO dataset outperforms previous approaches.
arXiv Detail & Related papers (2023-07-13T13:17:50Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement [12.857137513211866]
We propose an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation.
The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset.
We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain.
arXiv Detail & Related papers (2023-05-25T15:15:03Z) - Towards Precision in Appearance-based Gaze Estimation in the Wild [3.4253416336476246]
We present a large gaze estimation dataset, PARKS-Gaze, with wider head pose and illumination variation.
The proposed dataset is more challenging and enable models to generalize on unseen participants better than the existing in-the-wild datasets.
arXiv Detail & Related papers (2023-02-05T10:09:35Z) - 3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from
Synthetic Views [67.00931529296788]
We propose to train general gaze estimation models which can be directly employed in novel environments without adaptation.
We create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene.
We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available.
arXiv Detail & Related papers (2022-12-06T14:15:17Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Learning-by-Novel-View-Synthesis for Full-Face Appearance-based 3D Gaze
Estimation [8.929311633814411]
This work examines a novel approach for synthesizing gaze estimation training data based on monocular 3D face reconstruction.
Unlike prior works using multi-view reconstruction, photo-realistic CG models, or generative neural networks, our approach can manipulate and extend the head pose range of existing training data.
arXiv Detail & Related papers (2022-01-20T00:29:45Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Weakly-Supervised Physically Unconstrained Gaze Estimation [80.66438763587904]
We tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.
We propose a training algorithm along with several novel loss functions especially designed for the task.
We show significant improvements in (a) the accuracy of semi-supervised gaze estimation and (b) cross-domain generalization on the state-of-the-art physically unconstrained in-the-wild Gaze360 gaze estimation benchmark.
arXiv Detail & Related papers (2021-05-20T14:58:52Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Learning to Detect Head Movement in Unconstrained Remote Gaze Estimation
in the Wild [19.829721663742124]
We propose end-to-end appearance-based gaze estimation methods that could more robustly incorporate different levels of head-pose representations into gaze estimation.
Our method could generalize to real-world scenarios with low image quality, different lightings and scenarios where direct head-pose information is not available.
arXiv Detail & Related papers (2020-04-07T22:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.