Learning Physical Graph Representations from Visual Scenes
- URL: http://arxiv.org/abs/2006.12373v2
- Date: Wed, 24 Jun 2020 17:33:35 GMT
- Title: Learning Physical Graph Representations from Visual Scenes
- Authors: Daniel M. Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter,
Aran Nayebi, Jeremy Schwartz, Li Fei-Fei, Jiajun Wu, Joshua B. Tenenbaum,
Daniel L.K. Yamins
- Abstract summary: Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures.
We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
- Score: 56.7938395379406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks (CNNs) have proved exceptional at learning
representations for visual object categorization. However, CNNs do not
explicitly encode objects, parts, and their physical properties, which has
limited CNNs' success on tasks that require structured understanding of visual
scenes. To overcome these limitations, we introduce the idea of Physical Scene
Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the
hierarchy corresponding intuitively to object parts at different scales, and
edges to physical connections between parts. Bound to each node is a vector of
latent attributes that intuitively represent object properties such as surface
shape and texture. We also describe PSGNet, a network architecture that learns
to extract PSGs by reconstructing scenes through a PSG-structured bottleneck.
PSGNet augments standard CNNs by including: recurrent feedback connections to
combine low and high-level image information; graph pooling and vectorization
operations that convert spatially-uniform feature maps into object-centric
graph structures; and perceptual grouping principles to encourage the
identification of meaningful scene elements. We show that PSGNet outperforms
alternative self-supervised scene representation algorithms at scene
segmentation tasks, especially on complex real-world images, and generalizes
well to unseen object types and scene arrangements. PSGNet is also able learn
from physical motion, enhancing scene estimates even for static images. We
present a series of ablation studies illustrating the importance of each
component of the PSGNet architecture, analyses showing that learned latent
attributes capture intuitive scene properties, and illustrate the use of PSGs
for compositional scene inference.
Related papers
- Dynamic Graph Representation with Knowledge-aware Attention for
Histopathology Whole Slide Image Analysis [11.353826466710398]
We propose a novel dynamic graph representation algorithm that conceptualizes WSIs as a form of the knowledge graph structure.
Specifically, we dynamically construct neighbors and directed edge embeddings based on the head and tail relationships between instances.
Our end-to-end graph representation learning approach has outperformed the state-of-the-art WSI analysis methods on three TCGA benchmark datasets and in-house test sets.
arXiv Detail & Related papers (2024-03-12T14:58:51Z) - Two Stream Scene Understanding on Graph Embedding [4.78180589767256]
The paper presents a novel two-stream network architecture for enhancing scene understanding in computer vision.
The graph feature stream network comprises a segmentation structure, scene graph generation, and a graph representation module.
Experiments conducted on the ADE20K dataset demonstrate the effectiveness of the proposed two-stream network in improving image classification accuracy.
arXiv Detail & Related papers (2023-11-12T05:57:56Z) - Learning and generalization of compositional representations of visual
scenes [2.960473840509733]
We use distributed representations of object attributes and vector operations in a vector symbolic architecture to create a full compositional description of a scene.
To control the scene composition, we use artificial images composed of multiple, translated and colored MNIST digits.
The output of the deep network can then be interpreted by a VSA resonator network, to extract object identity or other properties of indiviual objects.
arXiv Detail & Related papers (2023-03-23T22:03:42Z) - Task-specific Scene Structure Representations [13.775485887433815]
We propose a single general neural network architecture for extracting task-specific structure guidance for scenes.
Our main contribution is to show that such a simple network can achieve state-of-the-art results for several low-level vision applications.
arXiv Detail & Related papers (2023-01-02T08:25:47Z) - A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective [71.03621840455754]
Graph Neural Networks (GNNs) have gained momentum in graph representation learning.
graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation.
This paper presents a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective.
arXiv Detail & Related papers (2022-09-27T08:10:14Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Learning Spatial Context with Graph Neural Network for Multi-Person Pose
Grouping [71.59494156155309]
Bottom-up approaches for image-based multi-person pose estimation consist of two stages: keypoint detection and grouping.
In this work, we formulate the grouping task as a graph partitioning problem, where we learn the affinity matrix with a Graph Neural Network (GNN)
The learned geometry-based affinity is further fused with appearance-based affinity to achieve robust keypoint association.
arXiv Detail & Related papers (2021-04-06T09:21:14Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks.
First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts.
Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.