G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
- URL: http://arxiv.org/abs/2405.07652v1
- Date: Mon, 13 May 2024 11:24:53 GMT
- Title: G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
- Authors: Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, Chun Yu,
- Abstract summary: This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA.
G-VOILA synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process.
- Score: 36.5550753978585
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
Related papers
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z) - GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding [5.94301570835109]
This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images.<n>It addresses person detection, gaze target detection, and gaze object identification.<n>GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task.
arXiv Detail & Related papers (2025-11-09T12:07:40Z) - Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm [36.752693539572086]
We introduce GLARIFY, a novel method to leverage gaze information to enhance the model's effectiveness in real-world applications.<n>We analyzed hundreds of samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns.<n>Experiments demonstrate that GLARIFY significantly outperforms baselines.
arXiv Detail & Related papers (2025-09-26T07:02:40Z) - From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users [21.627851460651968]
We present DF-P2E (Deepfake: Prediction to Explanation), a novel framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible.<n>We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date.<n> Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations.
arXiv Detail & Related papers (2025-08-11T03:55:47Z) - Creating General User Models from Computer Use [62.91116265732001]
This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer.<n>The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences.
arXiv Detail & Related papers (2025-05-16T04:00:31Z) - CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering [13.624962763072899]
KGQA systems typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications.
We propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification.
arXiv Detail & Related papers (2025-04-13T17:34:35Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - VERA: Generating Visual Explanations of Two-Dimensional Embeddings via Region Annotation [0.0]
Visual Explanations via Region (VERA) is an automatic embedding-annotation approach that generates visual explanations for any two-dimensional embedding.
VERA produces informative explanations that characterize distinct regions in the embedding space, allowing users to gain an overview of the embedding landscape at a glance.
We illustrate the usage of VERA on a real-world data set and validate the utility of our approach with a comparative user study.
arXiv Detail & Related papers (2024-06-07T10:23:03Z) - Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations.
Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents.
We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images.
We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively.
We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - A Convolutional Baseline for Person Re-Identification Using Vision and
Language Descriptions [24.794592610444514]
In real-world surveillance scenarios, frequently no visual information will be available about the queried person.
A two stream deep convolutional neural network framework supervised by cross entropy loss is presented.
The learnt visual representations are more robust and perform 22% better during retrieval as compared to a single modality system.
arXiv Detail & Related papers (2020-02-20T10:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.