Fugu-MT 論文翻訳(概要): FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

論文の概要: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

arxiv url: http://arxiv.org/abs/2509.09427v1
Date: Thu, 11 Sep 2025 13:10:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-12 16:52:24.391232
Title: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Title（参考訳）: FS-Diff:Semantic GuideとClarity-Aware Multimodal Image fusionと超解像
Authors: Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv, Huafeng Li,
Abstract要約: 軍事偵察や長距離検知といった現実世界の応用では、マルチモーダル画像のターゲット構造と背景構造が容易に破損する。 FS-Diff, 意味指導, 明快な関節画像融合および超解像法を提案する。
参考スコア（独自算出の注目度）: 19.183004285219184
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
Abstract（参考訳）: 影響のある情報融合と低レベルの視覚技術として、画像融合はソース画像からの相補的な情報を統合し、情報融合画像を生成する。近年、画像融合と超解像を共同で実現する試みがいくつか行われている。しかし、軍事偵察や長距離検出ミッションのような現実世界の応用では、マルチモーダル画像のターゲット構造と背景構造が容易に破損し、解像度が低く、セマンティック情報が弱くなり、現在の核融合技術における準最適結果をもたらす。そこで我々は,FS-Diffを提案する。FS-Diffは,意味的ガイダンスと明瞭性を考慮した関節画像融合と超解像法である。 FS-Diffは条件生成問題として画像融合と超解像を統一する。提案した明瞭度検出機構のセマンティックガイダンスを利用して,適応的低分解能知覚とクロスモーダル特徴抽出を行う。具体的には、目的とする融合結果を純粋なガウス雑音として初期化し、マルチモーダル画像のグローバルな特徴を抽出するための双方向特徴であるMambaを導入する。さらに,ソースコードとセマンティクスを条件として,修正されたU-Netネットワークを介してランダムな反復的復調処理を実装した。このネットワークは、複数のノイズレベルをデノナイズして、クロスモーダル特徴と豊富な意味情報を備えた高分解能融合結果を生成するために訓練されている。また,600対の画像をカバーする強力な空中ビューマルチシーン (AVMS) ベンチマークを構築した。 AVMSデータセットと6つの共同画像融合と超解像実験により、FS-Diffは複数の倍率で最先端の手法より優れ、融合した画像のよりリッチな詳細や意味を復元できることを示した。コードはhttps://github.com/XylonXu01/FS-Diffで公開されている。

論文の概要: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

関連論文リスト