Learning and generalization of compositional representations of visual
scenes
- URL: http://arxiv.org/abs/2303.13691v1
- Date: Thu, 23 Mar 2023 22:03:42 GMT
- Title: Learning and generalization of compositional representations of visual
scenes
- Authors: E. Paxon Frady, Spencer Kent, Quinn Tran, Pentti Kanerva, Bruno A.
Olshausen, Friedrich T. Sommer
- Abstract summary: We use distributed representations of object attributes and vector operations in a vector symbolic architecture to create a full compositional description of a scene.
To control the scene composition, we use artificial images composed of multiple, translated and colored MNIST digits.
The output of the deep network can then be interpreted by a VSA resonator network, to extract object identity or other properties of indiviual objects.
- Score: 2.960473840509733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Complex visual scenes that are composed of multiple objects, each with
attributes, such as object name, location, pose, color, etc., are challenging
to describe in order to train neural networks. Usually,deep learning networks
are trained supervised by categorical scene descriptions. The common
categorical description of a scene contains the names of individual objects but
lacks information about other attributes. Here, we use distributed
representations of object attributes and vector operations in a vector symbolic
architecture to create a full compositional description of a scene in a
high-dimensional vector. To control the scene composition, we use artificial
images composed of multiple, translated and colored MNIST digits. In contrast
to learning category labels, here we train deep neural networks to output the
full compositional vector description of an input image. The output of the deep
network can then be interpreted by a VSA resonator network, to extract object
identity or other properties of indiviual objects. We evaluate the performance
and generalization properties of the system on randomly generated scenes.
Specifically, we show that the network is able to learn the task and generalize
to unseen seen digit shapes and scene configurations. Further, the
generalisation ability of the trained model is limited. For example, with a gap
in the training data, like an object not shown in a particular image location
during training, the learning does not automatically fill this gap.
Related papers
- How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models [2.048226951354646]
An extreme form of such abstract representations is symbols.
This work estimates the state of such structured representations in vision encoders.
We show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
arXiv Detail & Related papers (2024-06-13T12:54:20Z) - Semiotics Networks Representing Perceptual Inference [0.0]
We present a computational model designed to track and simulate the perception of objects.
Our model is not limited to persons and can be applied to any system featuring a loop involving the processing from "internal" to "external" representations.
arXiv Detail & Related papers (2023-10-08T16:05:17Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks.
First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts.
Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z) - Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures.
We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.