Related papers: Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach

Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach

URL: http://arxiv.org/abs/2601.05269v2
Date: Mon, 12 Jan 2026 11:37:13 GMT
Title: Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach
Authors: Yoav Evron, Michal Bar-Asher Siegal, Michael Fire,
Abstract summary: We present a general and scalable AI-based pipeline for large-scale visual analysis of illuminated manuscripts.<n>The framework integrates modern deep-learning models for page-level illustration detection, illustration extraction, and multimodal description.<n>We demonstrate the applicability of this approach on large heterogeneous collections, including the Vatican Library and richly illuminated manuscripts such as the Bible of Borso d'Este.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual-artistic content embedded in historical illuminated manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically locate, extract, and analyze illustrations at scale remains a major challenge. We present a general and scalable AI-based pipeline for large-scale visual analysis of illuminated manuscripts. The framework integrates modern deep-learning models for page-level illustration detection, illustration extraction, and multimodal description, enabling scholars to search, cluster, and study visual materials and artistic trends across entire corpora. We demonstrate the applicability of this approach on large heterogeneous collections, including the Vatican Library and richly illuminated manuscripts such as the Bible of Borso d'Este. The system reveals meaningful visual patterns and cross-manuscript relationships by embedding illustrations into a shared representation space and analyzing their similarity structure (see figure 4). By harnessing recent advances in computer vision and vision-language models, our framework enables new forms of large-scale visual scholarship in historical studies, art history, and cultural heritage making it possible to explore iconography, stylistic trends, and cultural connections in ways that were previously impractical.

Related papers

From Show Programmes to Data: Designing a Workflow to Make Performing Arts Ephemera Accessible Through Language Models [0.3331620034375478]
We show how vision-language models can accurately parse and transcribe born-digital and digitised programmes.<n>We train a reasoning model (POntAvignon) using reinforcement learning with both formal and semantic rewards.<n>This approach enables automated RDF triple generation and supports alignment with existing knowledge graphs.
arXiv Detail & Related papers (2025-12-08T11:27:10Z)
Disc-Cover Complexity Trends in Music Illustrations from Sinatra to Swift [51.70874799858211]
We examine the visual complexity of album covers spanning 75 years and 11 popular musical genres.<n>Our analysis reveals a broad shift toward minimalism across most genres, with notable exceptions.<n>At the same time, we observe growing variance over time, with many covers continuing to display high levels of abstraction and intricacy.
arXiv Detail & Related papers (2025-10-01T15:01:25Z)
A Critical Assessment of Modern Generative Models' Ability to Replicate Artistic Styles [0.0]
This paper presents a critical assessment of the style replication capabilities of contemporary generative models.<n>We examine how effectively these models reproduce traditional artistic styles while maintaining structural integrity and compositional balance.<n>The analysis is based on a new large dataset of AI-generated works imitating artistic styles of the past.
arXiv Detail & Related papers (2025-02-21T07:00:06Z)
Diffusion-Based Visual Art Creation: A Survey and New Perspectives [51.522935314070416]
This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives. Our findings reveal how artistic requirements are transformed into technical challenges and highlight the design and application of diffusion-based methods within visual art creation. We aim to shed light on the mechanisms through which AI systems emulate and possibly, enhance human capacities in artistic perception and creativity.
arXiv Detail & Related papers (2024-08-22T04:49:50Z)
GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z)
Composition Vision-Language Understanding via Segment and Depth Anything Model [2.0836143651641033]
This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models. Our findings showcase progress in vision-language models through neural-symbolic integration.
arXiv Detail & Related papers (2024-06-07T16:28:06Z)
There Is a Digital Art History [1.0878040851637998]
We revisit Johanna Drucker's question, "Is there a digital art history?" We focus our analysis on two main aspects that seem to suggest a coming paradigm shift towards a "digital" art history.
arXiv Detail & Related papers (2023-08-14T21:21:03Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Automatic Image Content Extraction: Operationalizing Machine Learning in Humanistic Photographic Studies of Large Visual Archives [81.88384269259706]
We introduce Automatic Image Content Extraction framework for machine learning-based search and analysis of large image archives. The proposed framework can be applied in several domains in humanities and social sciences.
arXiv Detail & Related papers (2022-04-05T12:19:24Z)
From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence. Research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.