Fugu-MT 論文翻訳(概要): Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

論文の概要: Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

arxiv url: http://arxiv.org/abs/2604.21104v1
Date: Wed, 22 Apr 2026 21:43:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.193941
Title: Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
Title（参考訳）: プレトレインはどこ? : データ多様性の事前訓練が地空間モデルの性能に与える影響について
Authors: Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner,
Abstract要約: 性能差は主にモデルアーキテクチャや入力モダリティに起因するが、事前学習データセットの役割はめったに研究されていない。我々は、グローバルおよび大陸毎の事前トレーニングデータセットを作成し、グローバルおよび大陸毎の下流データセットで評価した。その結果,ヨーロッパにおけるプレトレーニングデータセットは,グローバルおよび地域下流評価において,グローバルおよび大陸固有のプレトレーニングデータセットよりも優れていた。
参考スコア（独自算出の注目度）: 15.19997016963026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.
Abstract（参考訳）: 新しい地理空間基盤モデルは、新しいモデルアーキテクチャと事前訓練データセットを導入し、しばしばデータ多様性の異なる概念を用いてサンプリングされる。性能差は主にモデルアーキテクチャや入力モダリティに起因するが、事前学習データセットの役割はめったに研究されていない。この研究ギャップに対処するため,プレトレーニングデータの地理的構成がモデル下流の性能に与える影響について,系統的研究を行った。我々は、グローバルおよび大陸毎の事前トレーニングデータセットを作成し、グローバルおよび大陸毎の下流データセットで評価した。その結果,ヨーロッパにおけるプレトレーニングデータセットは,グローバルおよび地域下流評価において,グローバルおよび大陸固有のプレトレーニングデータセットよりも優れていた。プレトレーニングデータセットの下流性能に影響を与える要因を明らかにするために,大陸,生物,土地被覆,およびスペクトル値の多様性を用いて,プレトレーニングデータセット10種を分析した。その結果,スペクトルの多様性のみが性能と強く相関し,他は弱い相関性を示した。この発見は、ハイパフォーマンスな事前学習データセットを作成する際に考慮すべき新しい多様性の次元を確立する。私たちは7つの新しい事前トレーニングデータセット、事前トレーニングされたモデル、および実験フレームワークをhttps://github.com/kerner-lab/pretrain-whereでオープンソース化しました。

論文の概要: Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

関連論文リスト