Fugu-MT 論文翻訳(概要): RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

論文の概要: RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

arxiv url: http://arxiv.org/abs/2504.03166v1
Date: Fri, 04 Apr 2025 04:47:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-14 20:52:19.160863
Title: RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation
Title（参考訳）: RingMoE:Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation
Authors: Hanbo Bi, Yingchao Feng, Boyuan Tong, Mengyu Wang, Haichen Yu, Yongqiang Mao, Hao Chang, Wenhui Diao, Peijin Wang, Yue Yu, Hanyang Peng, Yehong Zhang, Kun Fu, Xian Sun,
Abstract要約: RingMoEは147億のパラメータを持つ統一RS基盤モデルであり、9つの衛星から4億個のマルチモーダルRS画像に事前訓練されている。緊急対応、土地管理、海洋科学、都市計画など、様々な分野に展開および試行されている。
参考スコア（独自算出の注目度）: 24.48561340129571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
Abstract（参考訳）: 基礎モデルの急速な進歩は、自己指導的な方法で視覚表現学習に革命をもたらした。しかしながら、リモートセンシング(RS)におけるそれらの応用は、基本的には基本的なギャップによって制約されている:既存のモデルは、主に単一または限られたモードを扱い、RS観測の本質的にマルチモーダルな性質を見越す。光学的、合成開口レーダ(SAR)とマルチスペクトルデータにより、単一ソース解析における固有のあいまいさと不確実性を大幅に低減する補完的な洞察が得られる。このギャップを埋めるために、我々は147億のパラメータを持つ統合マルチモーダルRS基盤モデルであるRingMoEを導入し、9つの衛星から4億個のマルチモーダルRS画像で事前訓練した。 RingMoEは,(1)モダル特殊化,協調的,共有的な専門家からなる階層的混合実験(MoE)アーキテクチャ,(2)モダル表現間の対立を緩和するために相互依存を捉えつつモダル内知識を効果的にモデル化する,2) 物理インフォームド・セルフ教師付き学習,センサー固有の放射光特性を事前学習目的に明示的に組み込む,(3) 動的エキスパートプルニング,パフォーマンスを維持しつつ14.7Bから1Bパラメータへの適応モデル圧縮を実現し,地球観測アプリケーションの効率的な展開を容易にする。 6つの主要なRSタスク(分類、検出、セグメンテーション、追跡、変更検出、深さ推定)にまたがる23のベンチマークで評価され、RingMoEは既存の基礎モデルより優れ、新しいSOTAをセットし、単一モードからマルチモードシナリオへの顕著な適応性を示す。理論的な進歩の他に、緊急対応、土地管理、海洋科学、都市計画など、複数の分野に展開、試行されている。

関連論文リスト

Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning [18.268054258939213]
我々は,マルチモーダル検出器に線形探索評価を導入し,マルチモーダル物体検出タスクを再考する。 M$2$D-LIFという,モノモダリティ蒸留(M$2$D)法と局所照明対応核融合(LIF)モジュールからなる新しいフレームワークを構築した。我々のM$2$D-LIFは、Fusion Degradation現象を効果的に軽減し、以前のSOTA検出器より優れている。
論文参考訳（メタデータ） (2025-03-14T18:15:53Z)
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations [7.439049772394586]
Diffusion Augmented Retrieval (DAR)はMLLMの微調整を完全に回避したパラダイムシフトフレームワークである。 DARは、Diffusion Model (DM) ベースの視覚合成を用いて、LLM(Large Language Model) 誘導クエリ改善をシナジし、文脈的にリッチな中間表現を生成する。
論文参考訳（メタデータ） (2025-01-26T03:29:18Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
本稿では,リモートセンシングのためのマルチモーダルデータセットとマルチタスクオブジェクト検出(M2Det)という新しいタスクを提案する。水平方向または指向方向の物体を、あらゆるセンサーから正確に検出するように設計されている。この課題は、1)マルチモーダルモデリングの管理に関わるトレードオフ、2)マルチタスク最適化の複雑さに起因する。
論文参考訳（メタデータ） (2024-12-30T02:47:51Z)
Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
モダリティが欠如している場合でも,MMT(Missing Modality Token)という新しい概念を導入してパフォーマンスを維持する。テストセットの半分がモダル不完全である場合、元の$sim 30%$ dropから$sim 10%$に減らします。
論文参考訳（メタデータ） (2024-01-21T11:55:42Z)
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery [35.550999964460466]
本稿では,2150万の時間的シーケンスを持つマルチモーダルリモートセンシングデータセットを事前トレーニングした総称10億スケールモデルSkySenseを提案する。我々の知る限り、SkySenseは今までで最大のマルチモーダルであり、モジュールを柔軟に組み合わせたり、個別に使用して様々なタスクに適合させることができる。
論文参考訳（メタデータ） (2023-12-15T09:57:21Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
教師なしの事前訓練は骨格に基づく行動理解において大きな成功を収めた。我々はUmURLと呼ばれる統一マルチモーダル非教師なし表現学習フレームワークを提案する。 UmURLは効率的な早期融合戦略を利用して、マルチモーダル機能を単一ストリームで共同でエンコードする。
論文参考訳（メタデータ） (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
マルチモーダルな操作検出とグラウンド処理のためのトランスフォーマーベースのフレームワークを構築する。本フレームワークは,マルチモーダルアライメントの能力を維持しながら,モダリティ特有の特徴を同時に探求する。本稿では,グローバルな文脈的キューを各モーダル内に適応的に集約する暗黙的操作クエリ(IMQ)を提案する。
論文参考訳（メタデータ） (2023-09-22T06:55:41Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
我々は、欠落したモダリティ・イマジネーション・ネットワーク(IF-MMIN)に不変な特徴を用いることを提案する。提案モデルは,不確実なモダリティ条件下で,すべてのベースラインを上回り,全体の感情認識性能を不変に向上することを示す。
論文参考訳（メタデータ） (2022-10-27T12:16:25Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) は、2対のモダリティ表現で融合を行う新しいエンドツーエンドネットワークである。モデルは、モダリティ間の既知の情報不均衡により、2つのバイモーダルペアを入力として取る。
論文参考訳（メタデータ） (2021-07-28T23:33:42Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。