Related papers: DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models

DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models

URL: http://arxiv.org/abs/2404.14801v1
Date: Tue, 23 Apr 2024 07:31:19 GMT
Title: DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models
Authors: Jieru Lin, Danqing Huang, Tiejun Zhao, Dechen Zhan, Chin-Yew Lin,
Abstract summary: A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design.
Score: 35.10231741092462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.

Related papers

Rethinking Layered Graphic Design Generation with a Top-Down Approach [76.33538798060326]
Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing.<n>With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible.<n>Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs.<n>Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs.
arXiv Detail & Related papers (2025-07-08T02:26:08Z)
FlairGPT: Repurposing LLMs for Interior Designs [26.07841568311428]
We investigate if large language models (LLMs) can be directly utilized for interior design. By systematically probing LLMs, we can reliably generate a list of objects along with relevant constraints. We translate this information into a design layout graph, which is then solved using an off-the-shelf constrained optimization setup.
arXiv Detail & Related papers (2025-01-08T18:01:49Z)
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition [16.262338090888342]
We introduce the layered design principle into Large Multimodal Models (LMMs) LaDeCo performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. It subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context.
arXiv Detail & Related papers (2024-12-27T16:13:08Z)
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer [110.39467860530819]
Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. We present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector. Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid.
arXiv Detail & Related papers (2024-12-18T14:07:46Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation. Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts. We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs) We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions. Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z)
How Can Large Language Models Help Humans in Design and Manufacturing? [28.28959612862582]
Large Language Models (LLMs), including GPT-4, provide exciting new opportunities for generative design. We scrutinize the utility of LLMs in tasks such as: converting a text-based prompt into a design specification, transforming a design into manufacturing instructions, producing a design space and design variations, computing the performance of a design, and searching for designs predicated on performance. By exposing these limitations, we aspire to catalyze the continued improvement and progression of these models.
arXiv Detail & Related papers (2023-07-25T17:30:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.