Fugu-MT 論文翻訳(概要): M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

論文の概要: M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

arxiv url: http://arxiv.org/abs/2503.06446v1
Date: Sun, 09 Mar 2025 05:06:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-11 20:09:44.39914
Title: M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification
Title（参考訳）: マルチモーダルリモートセンシング分類のためのM$^3$amba:CLIP駆動マンバモデル
Authors: Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li,
Abstract要約: M$3$ambaは、マルチモーダル融合のための新しいエンドツーエンドのCLIP駆動のMambaモデルである。異なるモダリティの包括的セマンティック理解を実現するために,CLIP駆動型モダリティ固有アダプタを提案する。実験の結果、M$3$ambaは最先端の手法と比較して平均5.98%の性能向上が見られた。
参考スコア（独自算出の注目度）: 23.322598623627222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.
Abstract（参考訳）: マルチモーダル融合は、異なるモーダルからの情報を統合するための大きな約束である。しかし、モーダル一貫性の欠如により、リモートセンシング分野における既存のマルチモーダル融合法は、その融合設計における不完全意味情報と計算効率の低下という課題に直面している。視覚言語事前学習モデルCLIPが視覚特徴から強力な意味情報を効果的に抽出できることに着想を得て,これらの課題に対処するために,M$^3$ambaモデルを提案する。具体的には、直接推論による特定のドメインの理解のバイアスを回避するために、融合アーキテクチャにCLIP駆動のモダリティ特異的アダプターを導入し、元のCLIPエンコーダのモダリティ特異的認識を実現する。この統合されたフレームワークは、最小限のトレーニングにより、異なるモダリティの包括的セマンティック理解を達成することができ、それによって、クロスモーダルな特徴融合を導くことができる。モダリティマッピングの一貫性を高めるため、線形複雑性を伴うマルチモーダルマンバ融合アーキテクチャとクロスアテンションモジュールであるクロスSS2Dが設計され、完全な融合を実現するための効率的かつ効率的な情報相互作用を十分に考慮している。広汎な実験により,M$^3$ambaはリモートセンシング分野におけるマルチモーダルハイパースペクトル画像分類タスクにおける最先端手法と比較して,少なくとも5.98\%の性能向上を示し,精度と効率の両立を実現した。コードはhttps://github.com/kaka-Cao/M3amba.comで公開されている。

関連論文リスト

MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
ReID(Multi-modal object Re-IDentification)は、異なるモダリティから補完的な画像情報を活用することで、特定のオブジェクトを検索することを目的としている。近年、CLIPのような大規模事前学習モデルでは、従来のシングルモーダルオブジェクトReIDタスクで顕著なパフォーマンスを示している。マルチモーダルオブジェクトReIDのための新しいフレームワークであるMambaProを紹介する。
論文参考訳（メタデータ） (2024-12-14T06:33:53Z)
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment [37.213291617683325]
クロスモーダルアライメントはマルチモーダル表現融合に不可欠である。マルチモーダル核融合の効率的かつ効率的な方法であるAlignMambaを提案する。完全かつ不完全なマルチモーダル核融合タスクの実験は,提案手法の有効性と有効性を示す。
論文参考訳（メタデータ） (2024-12-01T14:47:41Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
既存の手法は、モーダル固有の事前訓練とジョイント・モーダルチューニングに大きく依存しており、新しいモーダルへと拡張する際の計算上の負担が大きくなった。 PathWeaveは、Modal-Path sWitchingとExpAnsion機能を備えた柔軟でスケーラブルなフレームワークである。 PathWeaveは最先端のMLLMと互換性があり、パラメータトレーニングの負担を98.73%削減する。
論文参考訳（メタデータ） (2024-10-26T13:19:57Z)
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
我々は,大規模な事前学習モデルを直接エンコーダや機能フューザとして統合するフレームワークであるStitchFusionを提案する。我々は,エンコーディング中に多方向アダプタモジュール(MultiAdapter)を導入し,モーダル間情報転送を実現する。本モデルは,最小限の追加パラメータを持つ4つのマルチモーダルセグメンテーションデータセット上での最先端性能を実現する。
論文参考訳（メタデータ） (2024-08-02T15:41:16Z)
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map [1.6570772838074355]
マルチモーダル埋め込みの視覚的探索とアライメントのための対話型システムであるModalChorusを設計する。 1) モーダル・フュージョン・マップ (MFM) を埋め込んだ新しい次元減少法である。ケーススタディでは、ゼロショット分類からクロスモーダル検索と生成までのシナリオにおいて、ModalChorusが直感的に誤調整と効率的な再調整の発見を容易にすることが示されている。
論文参考訳（メタデータ） (2024-07-17T04:49:56Z)
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Segは、任意の視覚的条件におけるモダリティの組み合わせから堅牢なセグメンテーションを実現する新しいフレームワークである。 4つのモダリティを持つ2つのベンチマークの実験は、Any2Segがマルチモーダル設定の下で最先端を達成することを示した。
論文参考訳（メタデータ） (2024-07-16T03:34:38Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semanticsを紹介する。我々は,グローバルな特徴とローカルな特徴の効果的な抽出と統合を保証するために,複数のスケールで機能融合を採用している。実験により,本手法は複数のデータセットにまたがって優れた性能を発揮することが示された。
論文参考訳（メタデータ） (2024-05-24T08:58:48Z)
FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba [19.761723108363796]
FusionMamba はコンピュータビジョンタスクにおいて CNN や Vision Transformers (ViT) が直面する課題を克服することを目的としている。このフレームワークは動的畳み込みとチャネルアテンション機構を統合することで、視覚的状態空間モデルMambaを改善している。実験により、FusionMambaは様々なマルチモーダル画像融合タスクや下流実験で最先端の性能を達成することが示された。
論文参考訳（メタデータ） (2024-04-15T06:37:21Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
教師なしの事前訓練は骨格に基づく行動理解において大きな成功を収めた。我々はUmURLと呼ばれる統一マルチモーダル非教師なし表現学習フレームワークを提案する。 UmURLは効率的な早期融合戦略を利用して、マルチモーダル機能を単一ストリームで共同でエンコードする。
論文参考訳（メタデータ） (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
マルチモーダルな操作検出とグラウンド処理のためのトランスフォーマーベースのフレームワークを構築する。本フレームワークは,マルチモーダルアライメントの能力を維持しながら,モダリティ特有の特徴を同時に探求する。本稿では,グローバルな文脈的キューを各モーダル内に適応的に集約する暗黙的操作クエリ(IMQ)を提案する。
論文参考訳（メタデータ） (2023-09-22T06:55:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。