Related papers: Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models

Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models

URL: http://arxiv.org/abs/2511.13378v1
Date: Mon, 17 Nov 2025 13:52:23 GMT
Title: Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models
Authors: Carlo Teo Pedretti, Davide Picca, Dario Rodighiero,
Abstract summary: Diagrams are crucial yet underexplored tools in many disciplines.<n>Their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital captions.<n>Visual Language Models (VLMs) can help us identify and interpret such hybrid pages in context.
Score: 0.5352699766206808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.

Related papers

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models [0.609170287691728]
We introduce a novel training paradigm to enhance the comprehension of diagrammatic images within vision-language models.<n>Our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content.
arXiv Detail & Related papers (2025-09-02T05:02:23Z)
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Visual Semantic Parsing: From Images to Abstract Meaning Representation [20.60579156219413]
We propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR) Our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Our findings point to important future research directions for improved scene understanding.
arXiv Detail & Related papers (2022-10-26T17:06:42Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.