Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
- URL: http://arxiv.org/abs/2509.21980v1
- Date: Fri, 26 Sep 2025 07:02:40 GMT
- Title: Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
- Authors: Zeyu Wang, Baiyu Chen, Kun Yan, Hongjing Piao, Hao Xue, Flora D. Salim, Yuanchun Shi, Yuntao Wang,
- Abstract summary: We introduce GLARIFY, a novel method to leverage gaze information to enhance the model's effectiveness in real-world applications.<n>We analyzed hundreds of samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns.<n>Experiments demonstrate that GLARIFY significantly outperforms baselines.
- Score: 36.752693539572086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the rise in popularity of smart glasses, users' attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users' attention may introduce ambiguity challenges: (1) users' verbal questions become ambiguous by using pronouns or skipping context, (2) humans' gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user's attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model's effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
Related papers
- Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding [7.281396624646809]
Eye gaze offers valuable cues about attention, short-term intent, and future actions.<n>We propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks.<n>We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze.
arXiv Detail & Related papers (2025-10-24T11:33:03Z) - In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting [12.567763863700058]
EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators.<n>Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions.<n>Our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues.
arXiv Detail & Related papers (2025-09-09T07:11:56Z) - VILOD: A Visual Interactive Labeling Tool for Object Detection [0.0]
This thesis develops and investigates "VILOD: A Visual Interactive Labeling tool for Object Detection"<n>It enables users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for Object Detection.<n>The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories.
arXiv Detail & Related papers (2025-08-29T19:27:10Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.<n>To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.<n>Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios [36.5550753978585]
This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA.
G-VOILA synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process.
arXiv Detail & Related papers (2024-05-13T11:24:53Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations.
Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents.
We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.