Temporal Slowness in Central Vision Drives Semantic Object Learning
- URL: http://arxiv.org/abs/2602.04462v1
- Date: Wed, 04 Feb 2026 11:47:58 GMT
- Title: Temporal Slowness in Central Vision Drives Semantic Object Learning
- Authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch,
- Abstract summary: Humans acquire semantic object representations from egocentric visual streams with minimal supervision.<n>This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience.
- Score: 10.92192887447196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
Related papers
- Human-level 3D shape perception emerges from multi-view learning [63.048728487674815]
We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
arXiv Detail & Related papers (2026-02-19T18:56:05Z) - Simulated Cortical Magnification Supports Self-Supervised Object Learning [8.07351541700131]
Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers.<n>Here, we investigate the role of this varying resolution in the development of object representations.
arXiv Detail & Related papers (2025-09-19T08:28:06Z) - Object Concepts Emerge from Motion [24.73461163778215]
We propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner.<n>Our key insight is that motion boundary serves as a strong signal for object-level grouping.<n>Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data.
arXiv Detail & Related papers (2025-05-27T18:09:02Z) - Human Gaze Boosts Object-Centered Representation Learning [7.473473243713322]
Recent self-supervised learning models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans.<n>Here, we investigate whether focusing on central visual information boosts egocentric visual object learning.<n>Our experiments demonstrate that focusing on central vision leads to better object-centered representations.
arXiv Detail & Related papers (2025-01-06T12:21:40Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Spotlight Attention: Robust Object-Centric Learning With a Spatial
Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene.
We incorporate a spatial-locality prior into state-of-the-art object-centric vision models.
We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - Embodied vision for learning object representations [4.211128681972148]
We show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments.
We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions.
arXiv Detail & Related papers (2022-05-12T16:36:27Z) - Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects.
Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input.
We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.