Related papers: Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

URL: http://arxiv.org/abs/2510.21356v1
Date: Fri, 24 Oct 2025 11:33:03 GMT
Title: Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
Authors: Anupam Pani, Yanchao Yang,
Abstract summary: Eye gaze offers valuable cues about attention, short-term intent, and future actions.<n>We propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks.<n>We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze.
Score: 7.281396624646809
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM

Related papers

ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation [46.30718574969354]
egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames.<n>We propose ARGaze, which reformulates gaze estimation as sequential prediction.<n>We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation.
arXiv Detail & Related papers (2026-02-04T23:33:16Z)
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos [128.45606644157]
StreamGaze is the first benchmark to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos.<n>We develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories.<n>We observe substantial performance gaps between state-of-the-art MLLMs and human performance.
arXiv Detail & Related papers (2025-12-01T14:15:44Z)
Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z)
Eyes on Target: Gaze-Aware Object Detection in Egocentric Video [1.3320917259299652]
We propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework for egocentric videos.<n>Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions.<n>We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment.
arXiv Detail & Related papers (2025-11-03T05:21:58Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting [12.567763863700058]
EgoGazeVQA is an egocentric gaze-guided video question answering benchmark.<n>Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions.<n>Our gaze-guided intent prompting methods significantly enhance performance.
arXiv Detail & Related papers (2025-09-09T07:11:56Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z)
Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs) We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z)
Imitation Learning with Human Eye Gaze via Multi-Objective Prediction [3.5779268406205618]
We propose Gaze Regularized Imitation Learning (GRIL), a novel context-aware imitation learning architecture. GRIL learns concurrently from both human demonstrations and eye gaze to solve tasks where visual attention provides important context. We show that GRIL outperforms several state-of-the-art gaze-based imitation learning algorithms, simultaneously learns to predict human visual attention, and generalizes to scenarios not present in the training data.
arXiv Detail & Related papers (2021-02-25T17:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.