Textual Aesthetics in Large Language Models
- URL: http://arxiv.org/abs/2411.02930v1
- Date: Tue, 05 Nov 2024 09:22:08 GMT
- Title: Textual Aesthetics in Large Language Models
- Authors: Lingjie Jiang, Shaohan Huang, Xun Wu, Furu Wei,
- Abstract summary: We introduce a pipeline for aesthetics polishing and help construct a textual aesthetics dataset named TexAes.
We propose a textual aesthetics-powered fine-tuning method based on direct preference optimization, termed TAPO.
Our experiments demonstrate that using textual aesthetics data and employing the TAPO fine-tuning method not only improves aesthetic scores but also enhances performance on general evaluation datasets.
- Score: 80.09790024030525
- License:
- Abstract: Image aesthetics is a crucial metric in the field of image generation. However, textual aesthetics has not been sufficiently explored. With the widespread application of large language models (LLMs), previous work has primarily focused on the correctness of content and the helpfulness of responses. Nonetheless, providing responses with textual aesthetics is also an important factor for LLMs, which can offer a cleaner layout and ensure greater consistency and coherence in content. In this work, we introduce a pipeline for aesthetics polishing and help construct a textual aesthetics dataset named TexAes. We propose a textual aesthetics-powered fine-tuning method based on direct preference optimization, termed TAPO, which leverages textual aesthetics without compromising content correctness. Additionally, we develop two evaluation methods for textual aesthetics based on text and image analysis, respectively. Our experiments demonstrate that using textual aesthetics data and employing the TAPO fine-tuning method not only improves aesthetic scores but also enhances performance on general evaluation datasets such as AlpacalEval and Anera-hard.
Related papers
- Learning to Customize Text-to-Image Diffusion In Diverse Context [23.239646132590043]
Most text-to-image customization techniques fine-tune models on a small set of emphpersonal concept images captured in minimal contexts.
We resort to diversifying the context of these personal concepts by simply creating a contextually rich set of text prompts.
Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space.
Our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods.
arXiv Detail & Related papers (2024-10-14T00:53:59Z) - Intelligent Artistic Typography: A Comprehensive Review of Artistic Text Design and Generation [15.367944842667146]
Artistic text generation aims to amplify the aesthetic qualities of text while maintaining readability.
Artistic text stylization concentrates on the text effect overlaid upon the text, such as shadows, outlines, colors, glows, and textures.
Stylistization focuses on the deformation of the characters to strengthen their visual representation by mimicking the semantic understanding within the text.
arXiv Detail & Related papers (2024-07-20T06:45:09Z) - Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace [52.24866347353916]
We propose an efficient method to explore the target embedding in a textual subspace.
We also propose an efficient selection strategy for determining the basis of the textual subspace.
Our method opens the door to more efficient representation learning for personalized text-to-image generation.
arXiv Detail & Related papers (2024-06-30T06:41:21Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Image Aesthetics Assessment via Learnable Queries [59.313054821874864]
We propose the Image Aesthetics Assessment via Learnable Queries (IAA-LQ) approach.
It adapts learnable queries to extract aesthetic features from pre-trained image features obtained from a frozen image encoder.
Experiments on real-world data demonstrate the advantages of IAA-LQ, beating the best state-of-the-art method by 2.2% and 2.1% in terms of SRCC and PLCC, respectively.
arXiv Detail & Related papers (2023-09-06T09:42:16Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z) - Holistic Visual-Textual Sentiment Analysis with Prior Models [64.48229009396186]
We propose a holistic method that achieves robust visual-textual sentiment analysis.
The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions.
arXiv Detail & Related papers (2022-11-23T14:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.