Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder
- URL: http://arxiv.org/abs/2409.09716v1
- Date: Sun, 15 Sep 2024 12:47:39 GMT
- Title: Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder
- Authors: Krzysztof Krawiec, Antoni Nowinowski,
- Abstract summary: We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation.
We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network.
When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content.
- Score: 0.20718016474717196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition [43.61569815081384]
We propose Symmetric Superimposition Modeling to simultaneously capture local character features and linguistic information in text images.
At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context.
At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination.
arXiv Detail & Related papers (2024-05-09T15:23:38Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Recursive Neural Programs: Variational Learning of Image Grammars and
Part-Whole Hierarchies [1.5990720051907859]
We introduce Recursive Neural Programs (RNPs) to address the part-whole hierarchy learning problem.
RNPs are the first neural generative model to address the part-whole hierarchy learning problem.
Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes.
arXiv Detail & Related papers (2022-06-16T22:02:06Z) - Zero-Shot Text-Guided Object Generation with Dream Fields [111.06026544180398]
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects.
Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision.
In experiments, Dream Fields produce realistic, multi-view consistent object geometry and color from a variety of natural language captions.
arXiv Detail & Related papers (2021-12-02T17:53:55Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - Synthesizing human-like sketches from natural images using a conditional
convolutional decoder [3.3504365823045035]
We propose a fully convolutional end-to-end architecture that is able to synthesize human-like sketches of objects in natural images.
We train our structure in an end-to-end supervised fashion on a collection of sketch-image pairs.
The generated sketches of our architecture can be classified with 85.6% Top-5 accuracy and we verify their visual quality via a user study.
arXiv Detail & Related papers (2020-03-16T10:42:53Z) - Shallow2Deep: Indoor Scene Modeling by Single Image Understanding [42.87957414916607]
We present an automatic indoor scene modeling approach using deep features from neural networks.
Given a single RGB image, our method simultaneously recovers semantic contents, 3D geometry and object relationship.
arXiv Detail & Related papers (2020-02-22T23:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.