ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments
- URL: http://arxiv.org/abs/2601.13218v1
- Date: Mon, 19 Jan 2026 16:48:45 GMT
- Title: ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments
- Authors: Igor Vozniak, Philipp Mueller, Nils Lipp, Janis Sprenger, Konstantin Poddubnyy, Davit Hovhannisyan, Christian Mueller, Andreas Bulling, Philipp Slusallek,
- Abstract summary: We present a novel dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations.<n>The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult.<n>We propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models.
- Score: 15.487686125490812
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.
Related papers
- DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios [20.835782699441797]
This paper presents a novel dataset DynOPETs for object pose estimation and tracking in unconstrained environments.<n>Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels.<n>The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras.
arXiv Detail & Related papers (2025-03-25T13:13:44Z) - Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations [9.949149600332836]
Self-supervised vision models and slot-based representations excel at identifying edge-derived geometry but fail to preserve non-geometric surface-level cues.<n>We show that learning an auxiliary latent space over segmented patches, where VAE regularisation enforces compact, disentangled object-centric representations, recovers these missing attributes.
arXiv Detail & Related papers (2025-03-12T21:57:41Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Attribute Descent: Simulating Object-Centric Datasets on the Content
Level and Beyond [17.949962340691673]
Between synthetic and real, a two-level domain gap exists, involving content level and appearance level.
We propose an attribute descent approach that automatically optimize engine attributes to enable synthetic data to approximate real-world data.
Experiments on image classification and object re-identification confirm that adapted synthetic data can be effectively used in three scenarios.
arXiv Detail & Related papers (2022-02-28T18:58:05Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.