Related papers: Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

URL: http://arxiv.org/abs/2509.01959v1
Date: Tue, 02 Sep 2025 05:02:23 GMT
Title: Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Authors: Hiroshi Sasaki,
Abstract summary: We introduce a novel training paradigm to enhance the comprehension of diagrammatic images within vision-language models.<n>Our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content.
Score: 0.609170287691728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to specialised visual domains, such as diagrams, which encode structured, symbolic information distinct from that of natural imagery. In this paper, we introduce a novel training paradigm explicitly designed to enhance the comprehension of diagrammatic images within vision-language models. Our approach uses ``hard'' samples for our proposed contrastive learning that incorporates two specialised loss functions that leverage the inherent structural properties of diagrams. By integrating these objectives into model training, our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content. We empirically validate our approach on a benchmark dataset of flowcharts, as a representative class of diagrammatic imagery, demonstrating substantial improvements over standard CLIP and conventional hard negative CLIP learning paradigms for both image-text matching and visual question answering tasks. Our findings underscore the significance of tailored training strategies for specialised tasks and contribute to advancing diagrammatic understanding within the broader landscape of vision-language integration.

Related papers

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models [0.609170287691728]
We propose a new training paradigm designed to enhance diagram comprehension in vision-language models.<n>Our approach introduces pseudo contrastive samples generated by a diagram that creates synthetic diagrams using randomly picked text elements.<n>By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures.
arXiv Detail & Related papers (2026-02-27T01:39:53Z)
Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations.<n>Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives.<n>Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
arXiv Detail & Related papers (2025-02-19T21:30:51Z)
Language Model as Visual Explainer [72.88137795439407]
We present a systematic approach for interpreting vision models using a tree-structured linguistic explanation.<n>Our method provides human-understandable explanations in the form of attribute-laden trees.<n>To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations.
arXiv Detail & Related papers (2024-12-08T20:46:23Z)
In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations. We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.