Embodied vision for learning object representations
- URL: http://arxiv.org/abs/2205.06198v1
- Date: Thu, 12 May 2022 16:36:27 GMT
- Title: Embodied vision for learning object representations
- Authors: Arthur Aubret, C\'eline Teuli\`ere and Jochen Triesch
- Abstract summary: We show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments.
We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions.
- Score: 4.211128681972148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent time-contrastive learning approaches manage to learn invariant object
representations without supervision. This is achieved by mapping successive
views of an object onto close-by internal representations. When considering
this learning approach as a model of the development of human object
recognition, it is important to consider what visual input a toddler would
typically observe while interacting with objects. First, human vision is highly
foveated, with high resolution only available in the central region of the
field of view. Second, objects may be seen against a blurry background due to
infants' limited depth of field. Third, during object manipulation a toddler
mostly observes close objects filling a large part of the field of view due to
their rather short arms. Here, we study how these effects impact the quality of
visual representations learnt through time-contrastive learning. To this end,
we let a visually embodied agent "play" with objects in different locations of
a near photo-realistic flat. During each play session the agent views an object
in multiple orientations before turning its body to view another object. The
resulting sequence of views feeds a time-contrastive learning algorithm. Our
results show that visual statistics mimicking those of a toddler improve object
recognition accuracy in both familiar and novel environments. We argue that
this effect is caused by the reduction of features extracted in the background,
a neural network bias for large features in the image and a greater similarity
between novel and familiar background regions. We conclude that the embodied
nature of visual learning may be crucial for understanding the development of
human object perception.
Related papers
- Active Gaze Behavior Boosts Self-Supervised Object Learning [4.612042044544857]
We study whether a bio inspired visual learning model can harness toddlers' gaze behavior during a play session to develop view-invariant object recognition.
Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations.
Overall, our work reveals how toddlers' gaze behavior supports self-supervised learning of view-invariant object recognition.
arXiv Detail & Related papers (2024-11-04T10:44:46Z) - Learning 3D object-centric representation through prediction [12.008668555280668]
We develop a novel network architecture that learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth.
The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes.
arXiv Detail & Related papers (2024-03-06T14:19:11Z) - Spotlight Attention: Robust Object-Centric Learning With a Spatial
Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene.
We incorporate a spatial-locality prior into state-of-the-art object-centric vision models.
We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z) - A Computational Account Of Self-Supervised Visual Learning From
Egocentric Object Play [3.486683381782259]
We study how learning signals that equate different viewpoints can support robust visual learning.
We find that representations learned by equating different physical viewpoints of an object benefit downstream image classification accuracy.
arXiv Detail & Related papers (2023-05-30T22:42:03Z) - Understanding Self-Supervised Pretraining with Part-Aware Representation
Learning [88.45460880824376]
We study the capability that self-supervised representation pretraining methods learn part-aware representations.
Results show that the fully-supervised model outperforms self-supervised models for object-level recognition.
arXiv Detail & Related papers (2023-01-27T18:58:42Z) - Bi-directional Object-context Prioritization Learning for Saliency
Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations.
We observe that spatial attention works concurrently with object-based attention in the human visual recognition system.
We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z) - Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects.
Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input.
We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - VisualEchoes: Spatial Image Representation Learning through Echolocation [97.23789910400387]
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation.
We propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.
Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
arXiv Detail & Related papers (2020-05-04T16:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.