Self-supervised Photographic Image Layout Representation Learning
- URL: http://arxiv.org/abs/2403.03740v2
- Date: Tue, 20 Aug 2024 17:05:13 GMT
- Title: Self-supervised Photographic Image Layout Representation Learning
- Authors: Zhaoran Zhao, Peng Lu, Xujun Peng, Wenhao Guo,
- Abstract summary: We develop an autoencoder-based network architecture skilled in compressing heterogeneous layout graphs into precise, dimensionally-reduced layout representations.
We introduce the LODB dataset, which features a broader range of layout categories and richer semantics.
Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.
- Score: 5.009120058742792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by mapping these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.
Related papers
- SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation [0.0]
We introduce an unsupervised disentangled representation learning framework, Style-based Edge-augmented Variational Graph Auto-Encoder.
The framework generates architectural layout in the form of attributed adjacency multi-graphs while prioritizing representation disentanglement.
We contribute a new benchmark large-scale architectural layout graph dataset extracted from real-world floor plan images.
arXiv Detail & Related papers (2024-06-25T09:40:47Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Geometry Aligned Variational Transformer for Image-conditioned Layout
Generation [38.747175229902396]
We propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image.
First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images.
We construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations.
arXiv Detail & Related papers (2022-09-02T07:19:12Z) - Composition-aware Graphic Layout GAN for Visual-textual Presentation
Designs [24.29890251913182]
We study the graphic layout generation problem of producing high-quality visual-textual presentation designs for given images.
We propose a deep generative model, dubbed as composition-aware graphic layout GAN (CGL-GAN), to synthesize layouts based on the global and spatial visual contents of input images.
arXiv Detail & Related papers (2022-04-30T16:42:13Z) - Semantic Layout Manipulation with High-Resolution Sparse Attention [106.59650698907953]
We tackle the problem of semantic image layout manipulation, which aims to manipulate an input image by editing its semantic label map.
A core problem of this task is how to transfer visual details from the input images to the new semantic layout while making the resulting image visually realistic.
We propose a high-resolution sparse attention module that effectively transfers visual details to new layouts at a resolution up to 512x512.
arXiv Detail & Related papers (2020-12-14T06:50:43Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z) - Image Segmentation Using Deep Learning: A Survey [58.37211170954998]
Image segmentation is a key topic in image processing and computer vision.
There has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models.
arXiv Detail & Related papers (2020-01-15T21:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.