ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
- URL: http://arxiv.org/abs/2305.07490v6
- Date: Thu, 4 Apr 2024 18:55:18 GMT
- Title: ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
- Authors: Zhengqing Yuan, Yunhong He, Kun Wang, Yanfang Ye, Lichao Sun,
- Abstract summary: ArtGPT-4 is a large vision-language model tailored to address the limitations of existing models in artistic comprehension.
It can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation.
- Score: 19.830089364830066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of large language models (LLMs) has inspired an emerging research field of multimodal learning. However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters. To tackle this challenge, models such as MiniGPT-4 and LLaVA have been developed to fine-tune the pre-trained models using fewer parameters. Despite their promising performance, these models remain limited in their understanding of artistic imagery. To facilitate better artistic-understanding, in this paper, we propose ArtGPT-4, a pioneering large vision-language model tailored to address the limitations of existing models in artistic comprehension. The key innovation of ArtGPT-4 lies in its craft for the sophisticated challenge of artistic image comprehension, setting it apart from other models that overlook fine details for broader themes. Specifically, it works by integrating some specialized adapter layers into the LLM, enabling the model to more efficiently and effectively parse and interpret complex visual tokens, instead of fine-tuning the whole LLM as in the existing method. ArtGPT-4 has demonstrated its outstanding performance on the efficiency: utilizing a Tesla A100 device, its training can be completed in mere 2 hours with an image-text pair dataset comprising approximately 0.52M entries. Additionally, ArtGPT-4 has also achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets as well as the benchmarks established in this work, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The outstanding performance of ArtGPT-4 shows that it can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. The code and the pre-trained model are accessible in \url{https://github.com/DLYuanGod/ArtGPT-4}.
Related papers
- Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art [61.28133495240179]
We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.
Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.
We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
arXiv Detail & Related papers (2025-03-15T06:58:09Z) - CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements [1.0579965347526206]
Art, as a universal language, can be interpreted in diverse ways.
Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these models can be used to assess and interpret artworks.
arXiv Detail & Related papers (2025-02-04T18:08:23Z) - Visual Perception in Text Strings [24.60102607739684]
In this work, we select ASCII art as a representative artifact, where the lines and brightness used to depict each concept are rendered by characters.
We benchmark model performance on this task by constructing an evaluation dataset and also collect a training set to elicit the models' visual perception ability.
Results reveal that although humans can achieve nearly 100% accuracy, the state-of-the-art LLMs and MLLMs lag far behind.
arXiv Detail & Related papers (2024-10-02T16:46:01Z) - GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability.
Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI.
We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models [51.98253148764755]
We introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension.
ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains.
ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures.
arXiv Detail & Related papers (2024-03-01T02:21:30Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Diffusion Based Augmentation for Captioning and Retrieval in Cultural
Heritage [28.301944852273746]
This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain.
By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions.
arXiv Detail & Related papers (2023-08-14T13:59:04Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models [41.84885546518666]
GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text.
We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model.
We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
arXiv Detail & Related papers (2023-04-20T18:25:35Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - Towards mapping the contemporary art world with ArtLM: an art-specific
NLP model [0.0]
We present a generic Natural Language Processing framework (called ArtLM) to discover the connections among contemporary artists based on their biographies.
With extensive experiments, we demonstrate that our ArtLM achieves 85.6% accuracy and 84.0% F1 score.
We also provide a visualisation and a qualitative analysis of the artist network built from ArtLM's outputs.
arXiv Detail & Related papers (2022-12-14T09:26:07Z) - Art Style Classification with Self-Trained Ensemble of AutoEncoding
Transformations [5.835728107167379]
Artistic style of a painting is a rich descriptor that reveals both visual and deep intrinsic knowledge about how an artist uniquely portrays and expresses their creative vision.
In this paper, we investigate the use of deep self-supervised learning methods to solve the problem of recognizing complex artistic styles with high intra-class and low inter-class variation.
arXiv Detail & Related papers (2020-12-06T21:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.