Fugu-MT 論文翻訳(概要): StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

論文の概要: StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

arxiv url: http://arxiv.org/abs/2602.20089v2
Date: Wed, 25 Feb 2026 20:36:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 14:31:23.843448
Title: StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Title（参考訳）: StruXLIP:マルチモーダル構造キューによる視覚言語モデルの強化
Authors: Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani,
Abstract要約: 画像の視覚構造のためのプロキシとしてエッジマップを抽出する微調整アライメントパラダイムであるStruXLIPを紹介する。微調整は3つの構造中心の損失で標準アライメント損失を増大させる。提案手法は, 今後のアプローチに統合可能な, 一般的なブースティングレシピとして機能する。
参考スコア（独自算出の注目度）: 12.94672471629668
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StruXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StruXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StruXLIP.
Abstract（参考訳）: エッジベースの表現は視覚的理解の基本的な手がかりであり、初期の視覚研究に根ざした原理であり、現在でも中心となっている。我々は、この原理を視覚言語アライメントに拡張し、モダリティをまたいだ構造的手がかりの分離と整列は、長い詳細なキャプションの微調整に大いに役立ち、クロスモーダル検索の改善に特に重点を置いていることを示す。エッジマップ(例えば、Canny)を抽出し、それらを画像の視覚構造のためのプロキシとして扱い、対応するキャプションをフィルタリングして構造的手がかりを強調する、微調整アライメントパラダイムであるStruXLIPを導入し、それらを「構造中心」にする。ファインチューニングは、3つの構造中心の損失で標準アライメント損失を増大させる。 (i)エッジマップと構造テキストの整合 (二)ローカルエッジ領域とテキストチャンクとのマッチング、三エッジマップとカラー画像との接続により、表現の漂流を防止すること。理論的な観点から、標準のCLIPは視覚とテキストの埋め込みの相互情報を最大化するが、StruXLIPはマルチモーダル構造表現間の相互情報を最大化する。この補助最適化は本質的に困難であり、より堅牢でセマンティックに安定なミニマに向けてモデルを誘導し、視覚言語アライメントを向上させる。汎用ドメインと専門ドメインのクロスモーダル検索において、現在の競合相手よりも優れていますが、プラグイン・アンド・プレイ方式で将来のアプローチに統合可能な、一般的な強化レシピとして機能します。コードと事前訓練されたモデルは、https://github.com/intelligolabs/StruXLIPで公開されている。

論文の概要: StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

関連論文リスト