Related papers: Learning and generalization of compositional representations of visual scenes

Learning and generalization of compositional representations of visual scenes

URL: http://arxiv.org/abs/2303.13691v1
Date: Thu, 23 Mar 2023 22:03:42 GMT
Title: Learning and generalization of compositional representations of visual scenes
Authors: E. Paxon Frady, Spencer Kent, Quinn Tran, Pentti Kanerva, Bruno A. Olshausen, Friedrich T. Sommer
Abstract summary: We use distributed representations of object attributes and vector operations in a vector symbolic architecture to create a full compositional description of a scene. To control the scene composition, we use artificial images composed of multiple, translated and colored MNIST digits. The output of the deep network can then be interpreted by a VSA resonator network, to extract object identity or other properties of indiviual objects.
Score: 2.960473840509733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Complex visual scenes that are composed of multiple objects, each with attributes, such as object name, location, pose, color, etc., are challenging to describe in order to train neural networks. Usually,deep learning networks are trained supervised by categorical scene descriptions. The common categorical description of a scene contains the names of individual objects but lacks information about other attributes. Here, we use distributed representations of object attributes and vector operations in a vector symbolic architecture to create a full compositional description of a scene in a high-dimensional vector. To control the scene composition, we use artificial images composed of multiple, translated and colored MNIST digits. In contrast to learning category labels, here we train deep neural networks to output the full compositional vector description of an input image. The output of the deep network can then be interpreted by a VSA resonator network, to extract object identity or other properties of indiviual objects. We evaluate the performance and generalization properties of the system on randomly generated scenes. Specifically, we show that the network is able to learn the task and generalize to unseen seen digit shapes and scene configurations. Further, the generalisation ability of the trained model is limited. For example, with a gap in the training data, like an object not shown in a particular image location during training, the learning does not automatically fill this gap.

Related papers

How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models [2.048226951354646]
An extreme form of such abstract representations is symbols. This work estimates the state of such structured representations in vision encoders. We show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
arXiv Detail & Related papers (2024-06-13T12:54:20Z)
Semiotics Networks Representing Perceptual Inference [0.0]
We present a computational model designed to track and simulate the perception of objects. Our model is not limited to persons and can be applied to any system featuring a loop involving the processing from "internal" to "external" representations.
arXiv Detail & Related papers (2023-10-08T16:05:17Z)
Hyperbolic Contrastive Learning for Visual Representations beyond Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z)
Neural Groundplans: Persistent Neural Scene Representations from a Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation. We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z)
Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches. We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z)
Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
Object-Centric Learning with Slot Attention [43.684193749891506]
We present the Slot Attention module, an architectural component that interfaces with perceptual representations. Slot Attention produces task-dependent abstract representations which we call slots. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions.
arXiv Detail & Related papers (2020-06-26T15:31:57Z)
Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.