CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
- URL: http://arxiv.org/abs/2403.05124v1
- Date: Fri, 8 Mar 2024 07:37:21 GMT
- Title: CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
- Authors: Pengwei Yin, Guanzhong Zeng, Jingjing Wang, Di Xie
- Abstract summary: We propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge.
Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task.
- Score: 13.890404285565225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gaze estimation methods often experience significant performance degradation
when evaluated across different domains, due to the domain gap between the
testing and training data. Existing methods try to address this issue using
various domain generalization approaches, but with little success because of
the limited diversity of gaze datasets, such as appearance, wearable, and image
quality. To overcome these limitations, we propose a novel framework called
CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its
transferable knowledge. Our framework is the first to leverage the
vision-and-language cross-modality approach for gaze estimation task.
Specifically, we extract gaze-relevant feature by pushing it away from
gaze-irrelevant features which can be flexibly constructed via language
descriptions. To learn more suitable prompts, we propose a personalized context
optimization method for text prompt tuning. Furthermore, we utilize the
relationship among gaze samples to refine the distribution of gaze-relevant
features, thereby improving the generalization capability of the gaze
estimation model. Extensive experiments demonstrate the excellent performance
of CLIP-Gaze over existing methods on four cross-domain evaluations.
Related papers
- LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation [12.903711441941663]
The ability of gaze estimation models to generalize is often significantly hindered by various factors unrelated to gaze.
We propose a novel approach, reframing the gaze estimation task as a vision-language alignment issue.
Our proposed framework, named Language-Guided Gaze Estimation (LG-Gaze), learns continuous and geometry-sensitive features for gaze estimation benefit from the rich prior knowledges of vision-language models.
arXiv Detail & Related papers (2024-11-13T13:46:15Z) - Improving Domain Generalization on Gaze Estimation via Branch-out Auxiliary Regularization [3.3539987257923247]
Branch-out Auxiliary Regularization (BAR) is designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data.
Bar integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features.
arXiv Detail & Related papers (2024-05-02T16:26:37Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking [61.44701715285463]
This paper tackles the problem of passive gaze estimation using both event and frame data.
We reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states.
To improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network.
arXiv Detail & Related papers (2024-03-31T03:30:37Z) - HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain
Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features.
We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics.
This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance [9.639618473371083]
Existing gaze estimation approaches overlook the rich semantic cues conveyed by linguistic signals and the priors embedded in CLIP feature space.
Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues.
This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling in image estimations between heterogeneous inputs.
arXiv Detail & Related papers (2023-12-30T15:24:50Z) - Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Contrastive Representation Learning for Gaze Estimation [8.121462458089143]
We propose a contrastive representation learning framework for gaze estimation, named Gaze Contrastive Learning (GazeCLR)
Our results show that GazeCLR improves the performance of cross-domain gaze estimation and yields as high as 17.2% relative improvement.
The GazeCLR framework is competitive with state-of-the-art representation learning methods for few-shot evaluation.
arXiv Detail & Related papers (2022-10-24T17:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.