Related papers: ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

URL: http://arxiv.org/abs/2507.21917v1
Date: Tue, 29 Jul 2025 15:31:58 GMT
Title: ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
Authors: Nicola Fanelli, Gennaro Vessio, Giovanna Castellano,
Abstract summary: ArtSeek is a framework for art analysis that combines multimodal large language models with retrieval-augmented generation.<n>ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy.<n>Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia.
Score: 8.94249680213101
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.

Related papers

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding [16.9945713458689]
ArtRAG is a novel framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation.<n>At inference time, a structured retriever selects semantically and topologically relevant subgraphs to guide generation.<n>Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines.
arXiv Detail & Related papers (2025-05-09T13:08:27Z)
KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph [24.586916324061168]
We present KALE Knowledge-Augmented vision-Language model for artwork Elaborations. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. Experimental results demonstrate that KALE achieves strong performance over existing state-of-the-art work across several artwork datasets.
arXiv Detail & Related papers (2024-09-17T06:39:18Z)
GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z)
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions [64.89284104414865]
We introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations. MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks.
arXiv Detail & Related papers (2024-03-28T17:59:20Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task. Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z)
Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task. Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions. We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z)
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels [90.88501867321573]
We consider a new problem: fine-grained image recognition without expert annotations. We learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
arXiv Detail & Related papers (2021-11-05T17:58:37Z)
Object Retrieval and Localization in Large Art Collections using Deep Multi-Style Feature Fusion and Iterative Voting [10.807131260367298]
We introduce an algorithm that allows users to search for image regions containing specific motifs or objects. Our region-based voting with GPU-accelerated approximate nearest-neighbour search allows us to find and localize even small motifs within an extensive dataset in a few seconds.
arXiv Detail & Related papers (2021-07-14T18:40:49Z)
Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings [12.724750260261066]
ArtSAGENet is a novel architecture that integrates Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs)<n>We show that our proposed ArtSAGENet captures and encodes valuable dependencies between the artists and the artworks.<n>Our findings underline a great potential of integrating visual content and semantics for fine art analysis and curation.
arXiv Detail & Related papers (2021-05-17T23:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.