Fugu-MT 論文翻訳(概要): MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

論文の概要: MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

arxiv url: http://arxiv.org/abs/2509.12901v1
Date: Tue, 16 Sep 2025 09:58:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.018997
Title: MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion
Title（参考訳）: MSGFusion:マルチモーダルScene Graph-Guided Infrared and Visible Image Fusion
Authors: Guihui Li, Bowei Dong, Kaizhi Dong, Jiayi Li, Haiyong Zheng,
Abstract要約: 赤外線および可視画像のためのマルチモーダルシーングラフ誘導融合フレームワークであるMSGFusionを紹介する。テキストと視覚から得られる構造化されたシーングラフを深く結合することにより、MSGFusionはエンティティ、属性、空間関係を明示的に表現する。低照度オブジェクト検出、セマンティックセグメンテーション、医療画像融合といった下流タスクにおいて、セマンティック一貫性と一般化性を提供する。
参考スコア（独自算出の注目度）: 10.160499805076755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
Abstract（参考訳）: 複雑な厳しい環境下では、これらの2つのモードの強い相補性のために、赤外線と可視画像の融合がかなりの注目を集めている。深層学習に基づく融合法は特徴抽出、アライメント、融合、再構成において顕著な進歩を遂げてきたが、それでもテクスチャやコントラストといった低レベルの視覚的手がかりに大きく依存しており、画像に埋め込まれた高レベルの意味情報を捉えるのに苦労している。テキストを意味指導の源として組み込もうとする最近の試みは、エンティティ、属性、関係を明示的にモデル化したり、空間的ローカライゼーションを提供したりすることなく、微粒な融合性能を制限するような非構造的な記述に依存している。これらの課題を克服するために、赤外および可視画像のためのマルチモーダルシーングラフ誘導融合フレームワークMSGFusionを紹介する。テキストと視覚から導かれる構造化されたシーングラフを深く結合することにより、MSGFusionはエンティティ、属性、空間的関係を明示的に表現し、シーングラフ表現、階層的集約、グラフ駆動融合の連続モジュールを通して高レベルのセマンティクスと低レベルの詳細を同期的に洗練する。複数の公開ベンチマークでの広範囲な実験により、MSGFusionは最先端のアプローチ、特に詳細な保存と構造的明瞭さを著しく上回り、低照度物体の検出、セマンティックセグメンテーション、医用画像融合といった下流タスクにおいて優れたセマンティック一貫性と一般化性を提供することが示された。

論文の概要: MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

関連論文リスト