Voila-A: Aligning Vision-Language Models with User's Gaze Attention
- URL: http://arxiv.org/abs/2401.09454v1
- Date: Fri, 22 Dec 2023 17:34:01 GMT
- Title: Voila-A: Aligning Vision-Language Models with User's Gaze Attention
- Authors: Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma
- Abstract summary: We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
- Score: 56.755993500556734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the integration of vision and language understanding has led
to significant advancements in artificial intelligence, particularly through
Vision-Language Models (VLMs). However, existing VLMs face challenges in
handling real-world applications with complex scenes and multiple objects, as
well as aligning their focus with the diverse attention patterns of human
users. In this paper, we introduce gaze information, feasibly collected by AR
or VR devices, as a proxy for human attention to guide VLMs and propose a novel
approach, Voila-A, for gaze alignment to enhance the interpretability and
effectiveness of these models in real-world applications. First, we collect
hundreds of minutes of gaze data to demonstrate that we can mimic human gaze
modalities using localized narratives. We then design an automatic data
annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset.
Additionally, we innovate the Voila Perceiver modules to integrate gaze
information into VLMs while preserving their pretrained knowledge. We evaluate
Voila-A using a hold-out validation set and a newly collected VOILA-GAZE
Testset, which features real-life scenarios captured with a gaze-tracking
device. Our experimental results demonstrate that Voila-A significantly
outperforms several baseline models. By aligning model attention with human
gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and
fosters engaging human-AI interaction across a wide range of applications.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - In the Eye of the Beholder: Gaze and Actions in First Person Video [30.54510882243602]
We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera.
Our dataset comes with videos, gaze tracking data, hand masks and action annotations.
We propose a novel deep model for joint gaze estimation and action recognition in First Person Vision.
arXiv Detail & Related papers (2020-05-31T22:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.