Related papers: Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations

Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations

URL: http://arxiv.org/abs/2506.02764v1
Date: Tue, 03 Jun 2025 11:29:11 GMT
Title: Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations
Authors: Fatma Youssef Mohammed, Kostas Alexis,
Abstract summary: We show that free-viewing and visual search can efficiently share a common representation.<n>This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.
Score: 10.982521876026281
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computational human attention modeling in free-viewing and task-specific settings is often studied separately, with limited exploration of whether a common representation exists between them. This work investigates this question and proposes a neural network architecture that builds upon the Human Attention transformer (HAT) to test the hypothesis. Our results demonstrate that free-viewing and visual search can efficiently share a common representation, allowing a model trained in free-viewing attention to transfer its knowledge to task-driven visual search with a performance drop of only 3.86% in the predicted fixation scanpaths, measured by the semantic sequence score (SemSS) metric which reflects the similarity between predicted and human scanpaths. This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.

Related papers

Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention [49.99728312519117]
SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
arXiv Detail & Related papers (2025-07-24T15:19:23Z)
Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training [102.82553402539139]
Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image.<n>These models often face challenges in maintaining consistency across novel and reference views.<n>We propose to use epipolar geometry to locate and retrieve overlapping information from the input view.<n>This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning.
arXiv Detail & Related papers (2025-02-25T14:04:22Z)
L-WISE: Boosting Human Visual Category Learning Through Model-Based Image Selection and Enhancement [12.524893323311108]
We show that image perturbations can enhance the ability of humans to accurately report the ground truth class.<n>We propose to augment visual learning in humans in a way that improves human categorization accuracy at test time.
arXiv Detail & Related papers (2024-12-12T23:57:01Z)
Automatic Discovery of Visual Circuits [66.99553804855931]
We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
arXiv Detail & Related papers (2024-04-22T17:00:57Z)
Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z)
Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors [2.524526956420465]
CapMIT1003 is a database of captions and click-contingent image explorations collected during captioning tasks. NevaClip is a novel zero-shot method for predicting visual scanpaths.
arXiv Detail & Related papers (2023-05-21T07:24:50Z)
Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning. The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
A Graph-Enhanced Click Model for Web Search [67.27218481132185]
We propose a novel graph-enhanced click model (GraphCM) for web search. We exploit both intra-session and inter-session information for the sparsity and cold-start problems.
arXiv Detail & Related papers (2022-06-17T08:32:43Z)
Non-local Graph Convolutional Network for joint Activity Recognition and Motion Prediction [2.580765958706854]
3D skeleton-based motion prediction and activity recognition are two interwoven tasks in human behaviour analysis. We propose a new way to combine the advantages of both graph convolutional neural networks and recurrent neural networks for joint human motion prediction and activity recognition.
arXiv Detail & Related papers (2021-08-03T14:07:10Z)
Region Comparison Network for Interpretable Few-shot Image Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes. We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works. We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z)
A Meta-Bayesian Model of Intentional Visual Search [0.0]
We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning. To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze. Our model is able to recapitulate human behavioural metrics such as classification accuracy while retaining a high degree of interpretability, which we demonstrate by recovering subject-specific parameters from observed human behaviour.
arXiv Detail & Related papers (2020-06-05T16:10:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.