Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation
- URL: http://arxiv.org/abs/2505.19554v1
- Date: Mon, 26 May 2025 06:17:21 GMT
- Title: Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation
- Authors: Jiongchao Jin, Shengchu Zhao, Dajun Chen, Wei Jiang, Yong Li,
- Abstract summary: We propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability.<n>A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU) and qualitatively through a crowdsourced user study.
- Score: 7.980497203230983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time consumption and the complexity of manual layout design make automated layout generation a critical task, especially for multiple applications across different mobile devices. Existing graph-based layout generation approaches suffer from limited generative capability, often resulting in unreasonable and incompatible outputs. Meanwhile, vision based generative models tend to overlook the original structural information, leading to component intersections and overlaps. To address these challenges, we propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability. This novel pipeline utilizes graph features as hierarchical prior knowledge, replacing the traditional Vision Transformer (ViT) module in multimodal large language models (MLLM) to predict full layout information for the first time. Moreover, the intermediate graph matrix used as input for the LLM is human editable, enabling progressive, human centric design generation. A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU), and qualitatively through a crowdsourced user study. Additionally, sampling on relational features ensures diverse layout generation, further enhancing the adaptability and creativity of the proposed approach.
Related papers
- MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces [23.447713697204225]
MAGE is a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism.<n>We employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect.<n>Our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks.
arXiv Detail & Related papers (2025-07-29T12:17:46Z) - Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation [72.44384066166147]
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains.<n>Existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures.<n>We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch.
arXiv Detail & Related papers (2025-07-24T09:17:41Z) - Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures [50.46688111973999]
Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data.<n>We present a new blueprint that enables end-to-end representation of'relational entity graphs' without traditional engineering feature.<n>We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data.
arXiv Detail & Related papers (2025-06-19T23:51:38Z) - Relation-Aware Graph Foundation Model [21.86954503656643]
A graph foundation model (GFMs) has emerged as a promising direction in graph learning.<n>Unlike language models that rely on explicit token representations, graphs lack a well-defined unit for generalization.<n>We propose REEF, a novel framework that leverages relation tokens as the basic units for GFMs.
arXiv Detail & Related papers (2025-05-17T14:34:41Z) - Relational Graph Transformer [44.56132732108148]
Graph Transformer (Relel) is the first graph transformer designed specifically for relational tables.<n>Relel employs a novel multi-element tokenization strategy that decomposes each node into five components.<n>Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids.
arXiv Detail & Related papers (2025-05-16T07:51:58Z) - Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts [3.531533402602335]
Multimodal knowledge graph completion (MMKGC) aims to predict missing links in multimodal knowledge graphs (MMKGs)<n>Existing MMKGC approaches primarily extend traditional knowledge graph embedding (KGE) models.<n>We propose a novel approach that integrates Transformer-based KGE models with cross-modal context generated by pre-trained VLMs.
arXiv Detail & Related papers (2025-01-26T22:23:14Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.<n>We introduce key innovations to optimize generative performance for vision tasks.<n>The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.<n>Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.<n>We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.