Simulating Human Audiovisual Search Behavior
- URL: http://arxiv.org/abs/2602.02790v1
- Date: Mon, 02 Feb 2026 20:47:05 GMT
- Title: Simulating Human Audiovisual Search Behavior
- Authors: Hyunsung Cho, Xuejing Luo, Byungjoo Lee, David Lindlbauer, Antti Oulasvirta,
- Abstract summary: Locating a target based on auditory and visual cues requires balancing effort, time and accuracy under uncertainty.<n>We present Sennaut, a computational model of embodied audiovisual search.<n>Our simulation of audiovisual-like search informs the design that minimize cost and cognitive load.
- Score: 35.52238000281623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Locating a target based on auditory and visual cues$\unicode{x2013}$such as finding a car in a crowded parking lot or identifying a speaker in a virtual meeting$\unicode{x2013}$requires balancing effort, time, and accuracy under uncertainty. Existing models of audiovisual search often treat perception and action in isolation, overlooking how people adaptively coordinate movement and sensory strategies. We present Sensonaut, a computational model of embodied audiovisual search. The core assumption is that people deploy their body and sensory systems in ways they believe will most efficiently improve their chances of locating a target, trading off time and effort under perceptual constraints. Our model formulates this as a resource-rational decision-making problem under partial observability. We validate the model against newly collected human data, showing that it reproduces both adaptive scaling of search time and effort under task complexity, occlusion, and distraction, and characteristic human errors. Our simulation of human-like resource-rational search informs the design of audiovisual interfaces that minimize search cost and cognitive load.
Related papers
- Human-level 3D shape perception emerges from multi-view learning [63.048728487674815]
We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
arXiv Detail & Related papers (2026-02-19T18:56:05Z) - Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization [15.460138469890042]
Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information.<n>We propose a neuroscience-inspired model, EchoPin, which uses a stereo audio-image dataset generated via 3D simulations.
arXiv Detail & Related papers (2025-05-16T13:13:25Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.<n>Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Computing a human-like reaction time metric from stable recurrent vision
models [11.87006916768365]
We sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model.
We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks.
This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks.
arXiv Detail & Related papers (2023-06-20T14:56:02Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - Using Features at Multiple Temporal and Spatial Resolutions to Predict
Human Behavior in Real Time [2.955419572714387]
We present an approach for integrating high and low-resolution spatial and temporal information to predict human behavior in real time.
Our model composes neural networks for high and low-resolution feature extraction with a neural network for behavior prediction, with all three networks trained simultaneously.
arXiv Detail & Related papers (2022-11-12T18:41:33Z) - Target-absent Human Attention [44.10971508325032]
We propose the first data-driven computational model that addresses the search-termination problem.
We represent the internal knowledge that the viewer acquires through fixations using a novel state representation.
We improve the state of the art in predicting human target-absent search behavior on the COCO-Search18 dataset.
arXiv Detail & Related papers (2022-07-04T02:32:04Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Metric-based multimodal meta-learning for human movement identification
via footstep recognition [3.300376360949452]
We describe a novel metric-based learning approach that introduces a multimodal framework.
We learn general-purpose representations from low multisensory data obtained from omnipresent sensing systems.
Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity.
arXiv Detail & Related papers (2021-11-15T18:46:14Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions.
We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration.
Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.