Fugu-MT 論文翻訳(概要): MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

論文の概要: MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

arxiv url: http://arxiv.org/abs/2508.10133v1
Date: Wed, 13 Aug 2025 18:56:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.089546
Title: MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning
Title（参考訳）: MANGO:Multimodal Attention-based Normalizing Flow Approach to Fusion Learning
Authors: Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu,
Abstract要約: 本稿では,マルチモーダルアテンションに基づく正規化フロー(MANGO)アプローチを提案する。マルチモーダルデータのための正規化フローベースモデルを開発するために,新しい非可逆クロスアテンション層を提案する。また,MMCA(Modality-to-Modality Cross-Attention),IMCA(Inter-Modality Cross-Attention),ICA(Learable Inter-Modality Cross-Attention)の3つの新しいクロスアテンション機構を導入する。
参考スコア（独自算出の注目度）: 12.821814562210632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.
Abstract（参考訳）: マルチモーダル学習は近年大きな成功を収めている。しかし、現在のマルチモーダル融合法では、トランスフォーマーの注意機構を用いて、マルチモーダル特徴の根底にある相関関係を暗黙的に学習している。その結果、マルチモーダルモデルは各モーダルの本質的な特徴を捉えることができず、複雑な構造やマルチモーダル入力の相関を理解することは困難である。本稿では,Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{本研究のソースコードを公開する。明示的で、解釈可能で、トラクタブルなマルチモーダル・フュージョン・ラーニングを開発する。特に,マルチモーダルデータのための正規化フローベースモデルを開発するために,新しい非可逆クロスアテンション(ICA)層を提案する。提案する可逆的クロスアテンション層におけるマルチモーダルデータの複雑な相関関係を効率的に把握するために, MMCA (Modality-to-Modality Cross-Attention) とIMCA (Inter-Modality Cross-Attention) とLearningable Inter-Modality Cross-Attention (LICA) の3つの新しいクロスアテンション機構を提案する。最後に,提案手法の高次元マルチモーダルデータへの拡張性を実現するために,新しいマルチモーダルアテンションに基づく正規化フローを提案する。本研究は,3種類のマルチモーダル学習課題,すなわちセマンティックセグメンテーション,イメージ・ツー・イメージ翻訳,映画ジャンル分類に関する実験結果から,提案手法の最先端(SoTA)性能を実証した。

論文の概要: MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

関連論文リスト