Fugu-MT 論文翻訳(概要): MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

論文の概要: MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

arxiv url: http://arxiv.org/abs/2605.20090v1
Date: Tue, 19 May 2026 16:47:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.534059
Title: MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
Title（参考訳）: MetaEarth-MM:シーン中心の関節モデリングによる統合マルチモーダルリモートセンシング画像生成
Authors: Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi,
Abstract要約: マルチモーダルリモートセンシング画像のための生成基盤モデルMetaEarth-MMを開発した。我々のモデルは、下層のシーンコンテンツを中心に世代を編成する。多様な世代タスクにまたがる強力な生成能力と堅牢な一般化を示す。
参考スコア（独自算出の注目度）: 33.18025936405946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.
Abstract（参考訳）: マルチモードのリモートセンシング画像は地球観測には不可欠であるが、実際には完全なペアの観測はほとんどない。既存の生成法では、一対のモダリティ変換によってこの問題に対処することが多いが、その汎用性と拡張性は、モダリティや生成タスクの数が増えるにつれて制限される。そこで我々は,マルチモーダルリモートセンシング画像のための生成基盤モデルMetaEarth-MMを開発し,統一モデル内の5つのモーダルをまたいだ結合生成と任意の翻訳を可能にする。マルチモーダル観測に基づく固有のシーン一貫性を認識し,メタアースMMにおけるシーン中心の共同モデリングパラダイムを導入する。直接の外観レベルのクロスモーダルマッピングに依存する従来の手法とは異なり、我々のモデルは、下層のシーンコンテンツを中心に生成を整理する。具体的には、MetaEarth-MMは、まず利用可能な観測結果から遅延シーン表現を推論し、次にこの中間状態に条件付けられたターゲットモダリティを生成する疎結合アーキテクチャを採用する。トレーニングを支援するために、我々はさらに280万の多解像度グローバルイメージと2200万の整列ペアからなる大規模データセットであるEarthMMを構築した。広範な実験により、MetaEarth-MMは、多様な世代タスクにまたがる強力な生成能力と堅牢な一般化を示すだけでなく、データと表現レベルで下流タスクをサポートし、地球横断観測の一般的な基盤モデルとしての可能性を強調している。コードとデータセットはhttps://github.com/YZPioneer/MetaEarth-MMで入手できる。

論文の概要: MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

関連論文リスト