Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
- URL: http://arxiv.org/abs/2503.12018v1
- Date: Sat, 15 Mar 2025 06:58:09 GMT
- Title: Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
- Authors: Zhe Jin, Tat-Seng Chua,
- Abstract summary: We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.<n>Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.<n>We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
- Score: 61.28133495240179
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.
Related papers
- POSTA: A Go-to Framework for Customized Artistic Poster Generation [87.16343612086959]
POSTA is a modular framework for customized artistic poster generation.
Background Diffusion creates a themed background based on user input.
Design MLLM then generates layout and typography elements that align with and complement the background style.
ArtText Diffusion applies additional stylization to key text elements.
arXiv Detail & Related papers (2025-03-19T05:22:38Z) - IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait [51.18967854258571]
IC-Portrait is a novel framework designed to accurately encode individual identities for personalized portrait generation.
Our key insight is that pre-trained diffusion models are fast learners for in-context dense correspondence matching.
We show that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2025-01-28T18:59:03Z) - Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning [14.405750888492735]
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values.<n>Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets.<n>We propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight.
arXiv Detail & Related papers (2024-12-16T16:35:35Z) - APDDv2: Aesthetics of Paintings and Drawings Dataset with Artist Labeled Scores and Comments [45.57709215036539]
We introduce the Aesthetics Paintings and Drawings dataset (APDD), the first comprehensive collection of paintings encompassing 24 distinct artistic categories and 10 aesthetic attributes.
APDDv2 boasts an expanded image corpus and improved annotation quality, featuring detailed language comments.
We present an updated version of the Art Assessment Network for Specific Painting Styles, denoted as ArtCLIP. Experimental validation demonstrates the superior performance of this revised model in the realm of aesthetic evaluation, surpassing its predecessor in accuracy and efficacy.
arXiv Detail & Related papers (2024-11-13T11:46:42Z) - DiffArtist: Towards Structure and Appearance Controllable Image Stylization [19.5597806965592]
We present a comprehensive study on the simultaneous stylization of structure and appearance of 2D images.
We introduce DiffArtist, which is the first stylization method to allow for dual controllability over structure and appearance.
arXiv Detail & Related papers (2024-07-22T17:58:05Z) - AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception [74.11069437400398]
We develop a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks.
We fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert.
Experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-04-15T09:56:20Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.