Fugu-MT 論文翻訳(概要): LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

論文の概要: LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

arxiv url: http://arxiv.org/abs/2605.04323v2
Date: Fri, 08 May 2026 14:33:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:22.830835
Title: LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
Title（参考訳）: LUCAS-MEGA:土壌環境システムにおける表現学習のための大規模マルチモーダルデータセット
Authors: Kuangdai Leng, Simon Jeffery, Panos Panagos, Tarje Nissen-Meyer,
Abstract要約: 欧州の土壌環境観測の系統的なデータ融合によって構築された大規模データセットであるLUCASMEGAを紹介する。データセットは70,000以上のサンプルと、物理的、化学的、環境的、生物学的、視覚的属性にまたがる1,000以上の特徴で構成されている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.
Abstract（参考訳）: 土壌を理解することは農業、炭素循環、環境の持続可能性の基本であるが、高次元の表現学習ではなく、小規模の予測設定にモデリングを制約する断片的で異質なデータセットによって進歩は制限されている。 LUCAS-MEGAは,欧州の土壌環境観測の体系的なデータ融合によって構築された大規模マルチモーダルデータセットである。融合データセットは、70,000以上のサンプルと、68のソースデータセットから集計された物理的、化学的、環境的、生物学的、視覚的属性にまたがる1,000以上の特徴から構成される。大規模な統合を実現するため、異種データフォーマットと測定プロトコルを標準化し、不整合と不正なエントリ(例えば、単体不整合、コードブックミスマッチ、誤値)を解消し、自然言語アノテーションを組み込み、マルチモーダル属性とメタデータを統一された機械学習対応の機能空間に調和させる、マルチエージェントのヒューマン・イン・ザ・ループデータ融合パイプラインであるSoilFuserを開発した。得られたデータセットは、マルチモーダリティ、不均一な特徴カバレッジ、不均一な不確実性を含む、現実世界の土壌観測の重要な特徴をキャプチャする。データ駆動型モデリングにおけるLUCAS-MEGAの有用性を実証するために,特徴マスキングに基づく自己教師型目標を用いたマルチモーダル表型変換器(SoilFormer)を事前訓練し,安定したトレーニング,強力な予測性能,不確実性を考慮した予測を支援する表現を行う。さらに, 学習された表現が, 確立した土壌プロセスと整合した関係を回復することを示す。 LUCAS-MEGAはオープンアクセスでリリースされ、構造化クエリとデータ駆動ワークフローをサポートする構成可能な、エージェントフレンドリなAPIが付属している。

論文の概要: LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

関連論文リスト