ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding
- URL: http://arxiv.org/abs/2505.06020v1
- Date: Fri, 09 May 2025 13:08:27 GMT
- Title: ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding
- Authors: Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring,
- Abstract summary: ArtRAG is a novel framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation.<n>At inference time, a structured retriever selects semantically and topologically relevant subgraphs to guide generation.<n>Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines.
- Score: 16.9945713458689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.
Related papers
- ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval [8.94249680213101]
ArtSeek is a framework for art analysis that combines multimodal large language models with retrieval-augmented generation.<n>ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy.<n>Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia.
arXiv Detail & Related papers (2025-07-29T15:31:58Z) - Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution [1.8435193934665342]
We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings.<n>Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements.<n>Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks.
arXiv Detail & Related papers (2025-03-15T10:45:04Z) - Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art [61.28133495240179]
We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.<n>Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.<n>We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
arXiv Detail & Related papers (2025-03-15T06:58:09Z) - CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements [1.0579965347526206]
Art, as a universal language, can be interpreted in diverse ways.<n>Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these models can be used to assess and interpret artworks.
arXiv Detail & Related papers (2025-02-04T18:08:23Z) - VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models [53.59400446543756]
Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner.<n>We introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes.
arXiv Detail & Related papers (2024-10-02T16:48:47Z) - KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph [24.586916324061168]
We present KALE Knowledge-Augmented vision-Language model for artwork Elaborations.
KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph.
Experimental results demonstrate that KALE achieves strong performance over existing state-of-the-art work across several artwork datasets.
arXiv Detail & Related papers (2024-09-17T06:39:18Z) - Diffusion-Based Visual Art Creation: A Survey and New Perspectives [51.522935314070416]
This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives.
Our findings reveal how artistic requirements are transformed into technical challenges and highlight the design and application of diffusion-based methods within visual art creation.
We aim to shed light on the mechanisms through which AI systems emulate and possibly, enhance human capacities in artistic perception and creativity.
arXiv Detail & Related papers (2024-08-22T04:49:50Z) - GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability.
Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI.
We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z) - Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding [28.490495656348187]
We offer the Pun Rebus Art dataset for art understanding deeply rooted in traditional Chinese culture.
We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages.
Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations.
arXiv Detail & Related papers (2024-06-14T16:52:00Z) - Diffusion Based Augmentation for Captioning and Retrieval in Cultural
Heritage [28.301944852273746]
This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain.
By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions.
arXiv Detail & Related papers (2023-08-14T13:59:04Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Language Does More Than Describe: On The Lack Of Figurative Speech in
Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts.
These models have been trained using text data collected from content-based labelling protocols.
We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.