Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
- URL: http://arxiv.org/abs/2406.02774v2
- Date: Thu, 18 Jul 2024 16:59:08 GMT
- Title: Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
- Authors: Qiaomu Miao, Alexandros Graikos, Jingwei Zhang, Sounak Mondal, Minh Hoai, Dimitris Samaras,
- Abstract summary: Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators.
We propose the first semi-supervised method for gaze following by introducing two novel priors to the task.
Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset.
- Score: 74.30960564603917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Learning Diffusion Priors from Observations by Expectation Maximization [6.224769485481242]
We present a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only.
As part of our method, we propose and motivate an improved posterior sampling scheme for unconditional diffusion models.
arXiv Detail & Related papers (2024-05-22T15:04:06Z) - Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation [7.545077734926115]
We propose a simple and novel deep learning model designed to estimate gaze from videos.
Our method employs a spatial attention mechanism that tracks spatial dynamics within videos.
Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings.
arXiv Detail & Related papers (2024-04-08T06:07:32Z) - Enhancing Generalization in Medical Visual Question Answering Tasks via
Gradient-Guided Model Perturbation [16.22199565010318]
We introduce a method that incorporates gradient-guided perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases.
The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes.
arXiv Detail & Related papers (2024-03-05T06:57:37Z) - Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR)
Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method.
Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Patch-level Gaze Distribution Prediction for Gaze Following [49.93340533068501]
We introduce the patch distribution prediction ( PDP) method for gaze following training.
We show that our model regularizes the MSE loss by predicting better heatmap distributions on images with larger annotation variances.
Experiments show that our model bridging the gap between the target prediction and in/out prediction subtasks, showing a significant improvement on both subtasks on public gaze following datasets.
arXiv Detail & Related papers (2022-11-20T19:25:15Z) - Explanation-Guided Training for Cross-Domain Few-Shot Classification [96.12873073444091]
Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets.
We introduce a novel training approach for existing FSC models.
We show that explanation-guided training effectively improves the model generalization.
arXiv Detail & Related papers (2020-07-17T07:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.