Fugu-MT 論文翻訳(概要): MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

論文の概要: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

arxiv url: http://arxiv.org/abs/2507.04635v1
Date: Mon, 07 Jul 2025 03:37:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-08 15:46:35.267012
Title: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding
Title（参考訳）: MODA:マルチモーダル知覚・認知・感情理解のためのモジュール二重注意
Authors: Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、最近、複数のモーダル間のデータ統合において強力な能力を示した。 Modular Duplex Attention (MODA)は、インナー・モーダル・リファインメントとインターモーダル・インタラクションを同時に行う。 21のベンチマークデータセットの実験は、知覚、認知、感情タスクにおけるMODAの有効性を検証する。
参考スコア（独自算出の注目度）: 24.731387422897644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.
Abstract（参考訳）: マルチモーダル大規模言語モデル (MLLM) は, 一般化可能なアテンションアーキテクチャによって強化された, 複数モーダル間のデータ統合において, 強力な能力を示した。高度な手法は主に言語中心のチューニングに重点を置いているが、注意を通して混ざったマルチモーダルトークンの探索は少なく、きめ細かい認識と感情理解を必要とするハイレベルなタスクにおいて課題を提起している。本研究では,多モーダル学習における注意欠陥障害問題として,多モーダル学習における不整合的横断注意と層間減衰注意活性化に起因する注意欠陥問題を明らかにする。そこで本研究では,モジュラ二重注意(MODA)と呼ばれる新しい注意機構を提案する。 MODAは、層間トークンミキシングからモダリティアライメントを効果的に分離するために、正しいアフターアライメント戦略を採用している。アライメントフェーズでは、トークンは基底ベクトルに基づいて二重複素モジュラリティ空間にマッピングされ、視覚と言語のモダリティ間の相互作用を可能にする。さらに、アダプティブマスキングによるアダプティブマスキングにより、アダプティブマスキングによるアダプティブマスキングスコアの正しさが保証され、異なるモダリティに対するカスタマイズ可能なマスキングパターンを可能にすることにより、モデルの柔軟性が向上する。 21のベンチマークデータセットに対する大規模な実験は、知覚、認知、感情タスクにおけるMODAの有効性を検証する。ソースコードとデモはhttps://zzcheng.top/MODA.comで公開されている。

関連論文リスト

True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
MLLM(Multimodal Large Language Models)は、新しいタスクに適応したMICL(Multimodal In-Context Learning)を実現する。現在のMLLMは、視覚的手がかりを無視し、テキストパターンを過度に無視する傾向にあり、真のマルチモーダル適応よりも単なるテキスト模倣に繋がる。視覚的コンテキストへのモデルへの参加を促す,効率的な微調整戦略であるDynamic Attention Reallocation (DARA)を紹介した。
論文参考訳（メタデータ） (2025-07-21T17:08:18Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCaは、トレーニング済みのVLMバックボーンを効果的な双方向埋め込みモデルに変換するためのフレームワークである。 MoCaは、MMEBとViDoRe-v2ベンチマークのパフォーマンスを継続的に改善し、新しい最先端の結果を達成する。
論文参考訳（メタデータ） (2025-06-29T06:41:00Z)
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
本稿では,マルチモーダルなCo-AttenDWGアーキテクチャを提案する。我々はMIMICとSemEval Memotion 1.0に対するアプローチを検証した。
論文参考訳（メタデータ） (2025-05-25T07:26:00Z)
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network [6.304608172789466]
The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities。 MAVENは、モダリティ固有のエンコーダを使用して、同期化されたビデオフレーム、オーディオセグメント、および書き起こしから特徴を抽出する。このアーキテクチャは、会話ビデオにおける感情表現の微妙で過渡的な性質を捉え、現実の状況における感情認識を改善する。
論文参考訳（メタデータ） (2025-03-16T19:32:32Z)
Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition [37.12407597998884]
マルチモーダル対話における複雑な感情的手がかりを追跡するために,GraphSmileという新しい手法が提案されている。 GraphSmileは2つの重要なコンポーネント、すなわちGSFとSDPモジュールから構成される。複数のベンチマークにおける実証的な結果は、GraphSmileが複雑な感情的および感情的パターンを処理可能であることを示している。
論文参考訳（メタデータ） (2024-07-31T11:47:36Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
対話におけるマルチモーダル感情認識(MERC)は、世論監視、インテリジェントな対話ロボット、その他の分野に適用することができる。従来の作業では、マルチモーダル融合前のモーダル間アライメントプロセスとモーダル内ノイズ情報を無視していた。我々は,MGLRA(Masked Graph Learning with Recursive Alignment)と呼ばれる新しい手法を開発し,この問題に対処した。
論文参考訳（メタデータ） (2024-07-23T02:23:51Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
AIMDiTと呼ばれる新しいフレームワークを提案し、深い特徴のマルチモーダル融合の問題を解決する。公開ベンチマークデータセットMELDでAIMDiTフレームワークを使用して行った実験では、Acc-7とw-F1メトリクスの2.34%と2.87%の改善が明らかにされた。
論文参考訳（メタデータ） (2024-04-12T11:31:18Z)
MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
MMOE(Multimodal Mixtures of Experts)と呼ばれるマルチモーダルモデルの拡張手法を導入する。 MMoEは様々な種類のモデルに適用でき、改善できる。
論文参考訳（メタデータ） (2023-11-16T05:31:21Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
マルチモーダルな操作検出とグラウンド処理のためのトランスフォーマーベースのフレームワークを構築する。本フレームワークは,マルチモーダルアライメントの能力を維持しながら,モダリティ特有の特徴を同時に探求する。本稿では,グローバルな文脈的キューを各モーダル内に適応的に集約する暗黙的操作クエリ(IMQ)を提案する。
論文参考訳（メタデータ） (2023-09-22T06:55:41Z)
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input [27.102030262319197]
共用視覚と言語表現学習のためのtextbfSwitch-BERT を提案し,モダリティミスマッチの問題に対処する。 Switch-BERTは、学習可能な層と層間相互作用を導入することでBERTアーキテクチャを拡張している。結果は、ViLBERT や UNITER といった代替アーキテクチャが特定のタスクに優れているのに対して、Switch-BERT は一貫して優れたパフォーマンスや同等のパフォーマンスを達成できることを示している。
論文参考訳（メタデータ） (2023-06-25T09:28:40Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
モダリティ間の同調性は、特に認知に影響を及ぼすマルチモーダル融合の課題となる。本稿では,動的モダリティゲーティング(HCT-DMG)を用いた階層型クロスモーダルトランスを提案する。 HCT-DMG: 1) 従来のマルチモーダルモデルを約0.8Mパラメータで上回り、2) 不整合が認識に影響を及ぼすハードサンプルを認識し、3) 潜在レベルの非整合性をクロスモーダルアテンションで緩和する。
論文参考訳（メタデータ） (2023-05-23T01:24:15Z)
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning [23.472951216815765]
効果的なビデオ表現の鍵は、クロスモーダルな表現学習ときめ細かい特徴識別である。本稿では,表現モデリングにおけるモダリティ内関係とモダリティ間関係の強化について述べる。コントラスト学習方式によるハードペアによる特徴埋め込みの識別能力を拡大する。
論文参考訳（メタデータ） (2022-06-21T07:29:37Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。