Fugu-MT 論文翻訳(概要): Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

論文の概要: Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

arxiv url: http://arxiv.org/abs/2509.01959v1
Date: Tue, 02 Sep 2025 05:02:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.910434
Title: Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Title（参考訳）: マルチモーダルモデルのダイアグラム理解のための構造認識コントラスト学習
Authors: Hiroshi Sasaki,
Abstract要約: 本稿では,視覚言語モデルにおける図形画像の理解を高めるための新しい訓練パラダイムを提案する。本手法により, より構造化され, セマンティックに整合した図形内容の理解が構築できる。
参考スコア（独自算出の注目度）: 0.609170287691728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to specialised visual domains, such as diagrams, which encode structured, symbolic information distinct from that of natural imagery. In this paper, we introduce a novel training paradigm explicitly designed to enhance the comprehension of diagrammatic images within vision-language models. Our approach uses ``hard'' samples for our proposed contrastive learning that incorporates two specialised loss functions that leverage the inherent structural properties of diagrams. By integrating these objectives into model training, our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content. We empirically validate our approach on a benchmark dataset of flowcharts, as a representative class of diagrammatic imagery, demonstrating substantial improvements over standard CLIP and conventional hard negative CLIP learning paradigms for both image-text matching and visual question answering tasks. Our findings underscore the significance of tailored training strategies for specialised tasks and contribute to advancing diagrammatic understanding within the broader landscape of vision-language integration.
Abstract（参考訳）: Contrastive Language-Image Pre-Training (CLIP) モデルのようなマルチモーダルモデルは、視覚的および言語的表現の整合において顕著な成功を収めている。しかし、これらのモデルは、自然画像とは別のシンボル情報である構造化された情報を符号化する図のような、特殊化された視覚領域に適用する場合の限界を示す。本稿では,視覚言語モデルにおける図形画像の理解を高めるために設計された,新しい学習パラダイムを提案する。提案手法では, 図形固有の構造特性を利用する2つの特殊化損失関数を組み込んだ, コントラスト学習に `hard' サンプルを用いる。これらの目的をモデルトレーニングに組み込むことで、モデルがより構造化され、セマンティックに整合した図形コンテンツ理解を開発することができる。図形画像の代表的なクラスであるフローチャートのベンチマークデータセットに対する我々のアプローチを実証的に検証し、画像テキストマッチングと視覚的質問応答の両タスクにおいて、標準CLIPと従来のハードネガティブCLIP学習パラダイムを大幅に改善したことを示す。本研究は,視覚言語統合のより広い視野における図形理解の促進に寄与する,専門的なタスクのための調整されたトレーニング戦略の重要性を裏付けるものである。

論文の概要: Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

関連論文リスト