RAZE: Region Guided Self-Supervised Gaze Representation Learning
- URL: http://arxiv.org/abs/2208.02485v2
- Date: Fri, 5 Aug 2022 13:02:04 GMT
- Title: RAZE: Region Guided Self-Supervised Gaze Representation Learning
- Authors: Neeru Dubey, Shreya Ghosh, Abhinav Dhall
- Abstract summary: RAZE is a Region guided self-supervised gAZE representation learning framework which leverage from non-annotated facial image data.
Ize-Net is a capsule layer based CNN architecture which can efficiently capture rich eye representation.
- Score: 5.919214040221055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic eye gaze estimation is an important problem in vision based
assistive technology with use cases in different emerging topics such as
augmented reality, virtual reality and human-computer interaction. Over the
past few years, there has been an increasing interest in unsupervised and
self-supervised learning paradigms as it overcomes the requirement of large
scale annotated data. In this paper, we propose RAZE, a Region guided
self-supervised gAZE representation learning framework which leverage from
non-annotated facial image data. RAZE learns gaze representation via auxiliary
supervision i.e. pseudo-gaze zone classification where the objective is to
classify visual field into different gaze zones (i.e. left, right and center)
by leveraging the relative position of pupil-centers. Thus, we automatically
annotate pseudo gaze zone labels of 154K web-crawled images and learn feature
representations via `Ize-Net' framework. `Ize-Net' is a capsule layer based CNN
architecture which can efficiently capture rich eye representation. The
discriminative behaviour of the feature representation is evaluated on four
benchmark datasets: CAVE, TabletGaze, MPII and RT-GENE. Additionally, we
evaluate the generalizability of the proposed network on two other downstream
task (i.e. driver gaze estimation and visual attention estimation) which
demonstrate the effectiveness of the learnt eye gaze representation.
Related papers
- Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - SeeBel: Seeing is Believing [0.9790236766474201]
We propose three visualizations that enable users to compare dataset statistics and AI performance for segmenting all images.
Our project tries to further increase the interpretability of the trained AI model for segmentation by visualizing its image attention weights.
We propose to conduct surveys on real users to study the efficacy of our visualization tool in computer vision and AI domain.
arXiv Detail & Related papers (2023-12-18T05:11:00Z) - Masked Contrastive Graph Representation Learning for Age Estimation [44.96502862249276]
This paper utilizes the property of graph representation learning in dealing with image redundancy information.
We propose a novel Masked Contrastive Graph Representation Learning (MCGRL) method for age estimation.
Experimental results on real-world face image datasets demonstrate the superiority of our proposed method over other state-of-the-art age estimation approaches.
arXiv Detail & Related papers (2023-06-16T15:53:21Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - Gaze Estimation with Eye Region Segmentation and Self-Supervised
Multistream Learning [8.422257363944295]
We present a novel multistream network that learns robust eye representations for gaze estimation.
We first create a synthetic dataset containing eye region masks detailing the visible eyeball and iris using a simulator.
We then perform eye region segmentation with a U-Net type model which we later use to generate eye region masks for real-world images.
arXiv Detail & Related papers (2021-12-15T04:44:45Z) - Region Similarity Representation Learning [94.88055458257081]
Region Similarity Representation Learning (ReSim) is a new approach to self-supervised representation learning for localization-based tasks.
ReSim learns both regional representations for localization as well as semantic image-level representations.
We show how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline.
arXiv Detail & Related papers (2021-03-24T00:42:37Z) - Goal-Oriented Gaze Estimation for Zero-Shot Learning [62.52340838817908]
We introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization.
We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description.
This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks.
arXiv Detail & Related papers (2021-03-05T02:14:57Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.