Related papers: ALOHA: from Attention to Likes -- a unified mOdel for understanding HumAn responses to diverse visual content

ALOHA: from Attention to Likes -- a unified mOdel for understanding HumAn responses to diverse visual content

URL: http://arxiv.org/abs/2312.10175v2
Date: Thu, 4 Jul 2024 21:01:48 GMT
Title: ALOHA: from Attention to Likes -- a unified mOdel for understanding HumAn responses to diverse visual content
Authors: Peizhao Li, Junfeng He, Gang Li, Rachit Bhargava, Shaolei Shen, Nachiappan Valliappan, Youwei Liang, Hongxiang Gu, Venky Ramachandran, Golnaz Farhadi, Yang Li, Kai J Kohlhoff, Vidhya Navalpakkam,
Abstract summary: We propose ALOHA -- a unified model for understanding human responses from attention to likes. ALOHA predicts different human responses such as attention heatmaps, scanpath or viewing order, as well as subjective rating/preference. Potential applications include providing instant feedback on the effectiveness of UIs/designs/images, and serving as a reward model to further optimize visual-content creation.
Score: 12.281060227170792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior such as human attention and explicit, later-stage behavior such as subjective preferences/likes. Yet, most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. Can we build a unified model of human attention and preference behavior that works reliably across diverse types of visual content? Such a model would enable predicting subjective feedback such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order, enabling designers and content-creation models to optimize their creation for human-centric improvements. In this paper, we propose ALOHA -- a unified model for understanding human responses from attention to likes, across diverse visual content. ALOHA leverages a multimodal transformer % featuring distinct prediction heads for each facet, and predicts different human responses such as attention heatmaps, scanpath or viewing order, as well as subjective rating/preference. We train ALOHA on diverse public datasets spanning natural images, webpages and graphic designs, and achieve SOTA performance on multiple benchmarks across different image domains and various behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/designs/images, and serving as a reward model to further optimize visual-content creation.

Related papers

SITE: towards Spatial Intelligence Thorough Evaluation [121.1493852562597]
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships.<n>We introduce SITE, a benchmark dataset towards SI Thorough Evaluation.<n>Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science.
arXiv Detail & Related papers (2025-05-08T17:45:44Z)
Testing the limits of fine-tuning to improve reasoning in vision language models [51.58859621164201]
We introduce visual stimuli and human judgments on visual cognition tasks to evaluate performance across cognitive domains. We fine-tune models on ground truth data for intuitive physics and causal reasoning. We find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
arXiv Detail & Related papers (2025-02-21T18:58:30Z)
Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions [0.03495246564946555]
We introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar. These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions.
arXiv Detail & Related papers (2024-12-11T07:51:18Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos [9.159660801125812]
Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects. In this work, we propose a novel end-to-end category to scenery framework, CATS. We construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories.
arXiv Detail & Related papers (2024-07-01T02:42:55Z)
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system. We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model [69.12623428463573]
AlignDiff is a novel framework to quantify human preferences, covering abstractness, and guide diffusion planning. It can accurately match user-customized behaviors and efficiently switch from one to another. We demonstrate its superior performance on preference matching, switching, and covering compared to other baselines.
arXiv Detail & Related papers (2023-10-03T13:53:08Z)
Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z)
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets [0.0]
We select publicly available state-of-the-art visual search models in natural scenes and evaluate them on different datasets. We propose an improvement to the Ideal Bayesian Searcher through a combination with a neural network-based visual search model.
arXiv Detail & Related papers (2021-12-10T19:56:45Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.