Fugu-MT 論文翻訳(概要): Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

論文の概要: Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

arxiv url: http://arxiv.org/abs/2603.26052v1
Date: Fri, 27 Mar 2026 03:38:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.347559
Title: Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Title（参考訳）: 画像と単語のブリッジ:マルチモーダルメディア検証のためのマスク対応ローカルセマンティックフュージョン
Authors: Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin,
Abstract要約: MaLSF(Mask-aware Local Semantic Fusion)は、パラダイムをアクティブな双方向検証に移行する新しいフレームワークである。マスクとラベルのペアをセマンティックアンカーとして使用し、ピクセルと単語をブリッジする。 MaLSFはDGM4とマルチモーダルのフェイクニュース検出タスクの両方で最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 13.571218577944032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.
Abstract（参考訳）: マルチモーダルな誤報がより洗練されるにつれて、その検出と接地が重要となる。しかし、現在のマルチモーダル検証手法は、パッシブ・ホメスティック・フュージョン(英語版)に依存し、洗練された誤報に悩まされている。機能的希釈」により、グローバルアライメントは微妙な局所的な意味的不整合を平均化し、彼らが見つけるように設計された紛争を効果的に隠蔽する傾向にある。 MaLSF(Mask-aware Local Semantic Fusion)は,人間の認知的相互参照を模倣して,パラダイムをアクティブかつ双方向な検証に移行する新しいフレームワークである。 MaLSFは、マスクとラベルのペアをセマンティックアンカーとして使用し、ピクセルと単語をブリッジする。その中核となるメカニズムは2つのイノベーションである。 1) 並列クエリストリーム(Text-as-QueryとImage-as-Query)を使用して競合を明示的に特定する双方向クロスモーダル検証(BCV)モジュール。 2)階層的セマンティック・アグリゲーション(HSA)モジュールは,タスク固有の推論のために,これらの多粒性競合信号をインテリジェントに集約する。さらに, マスクラベル対を微細に抽出するために, 多様なマスクラベル対抽出パーサを導入する。 MaLSFはDGM4とマルチモーダルのフェイクニュース検出タスクの両方で最先端のパフォーマンスを達成する。広範囲にわたるアブレーション研究と可視化結果は、その有効性と解釈可能性をさらに検証する。

論文の概要: Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

関連論文リスト