Better, Not Just More: Data-Centric Machine Learning for Earth   Observation
        - URL: http://arxiv.org/abs/2312.05327v3
- Date: Tue, 05 Nov 2024 14:12:40 GMT
- Title: Better, Not Just More: Data-Centric Machine Learning for Earth   Observation
- Authors: Ribana Roscher, Marc Rußwurm, Caroline Gevaert, Michael Kampffmeyer, Jefersson A. dos Santos, Maria Vakalopoulou, Ronny Hänsch, Stine Hansen, Keiller Nogueira, Jonathan Prexl, Devis Tuia, 
- Abstract summary: We argue that a shift from a model-centric view to a complementary data-centric perspective is necessary for further improvements in accuracy, generalization ability, and real impact on end-user applications.
This work presents a definition as well as a precise categorization and overview of automated data-centric learning approaches for geospatial data.
- Score: 16.729827218159038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recent developments and research in modern machine learning have led to substantial improvements in the geospatial field. Although numerous deep learning architectures and models have been proposed, the majority of them have been solely developed on benchmark datasets that lack strong real-world relevance. Furthermore, the performance of many methods has already saturated on these datasets. We argue that a shift from a model-centric view to a complementary data-centric perspective is necessary for further improvements in accuracy, generalization ability, and real impact on end-user applications. Furthermore, considering the entire machine learning cycle-from problem definition to model deployment with feedback-is crucial for enhancing machine learning models that can be reliable in unforeseen situations. This work presents a definition as well as a precise categorization and overview of automated data-centric learning approaches for geospatial data. It highlights the complementary role of data-centric learning with respect to model-centric in the larger machine learning deployment cycle. We review papers across the entire geospatial field and categorize them into different groups. A set of representative experiments shows concrete implementation examples. These examples provide concrete steps to act on geospatial data with data-centric machine learning approaches. 
 
      
        Related papers
        - Towards Scalable and Generalizable Earth Observation Data Mining via   Foundation Model Composition [0.0]
 We investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance.<n>The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models.<n>The study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models.
 arXiv  Detail & Related papers  (2025-06-25T07:02:42Z)
- DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic   Data [61.62554324594797]
 We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.
In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.
For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
 arXiv  Detail & Related papers  (2025-01-03T19:00:00Z)
- Data-Centric Machine Learning for Earth Observation: Necessary and   Sufficient Features [5.143097874851516]
 We leverage model explanation methods to identify the features crucial for the model to reach optimal performance.
Some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.
 arXiv  Detail & Related papers  (2024-08-21T07:26:43Z)
- A Data-Centric Perspective on Evaluating Machine Learning Models for   Tabular Data [9.57464542357693]
 This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
 arXiv  Detail & Related papers  (2024-07-02T09:54:39Z)
- Dataset Regeneration for Sequential Recommendation [69.93516846106701]
 We propose a data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR.
To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets.
 arXiv  Detail & Related papers  (2024-05-28T03:45:34Z)
- TSPP: A Unified Benchmarking Tool for Time-series Forecasting [3.5415344166235534]
 We propose a unified benchmarking framework that exposes the crucial modelling and machine learning decisions involved in developing time series forecasting models.
This framework fosters seamless integration of models and datasets, aiding both practitioners and researchers in their development efforts.
We benchmark recently proposed models within this framework, demonstrating that carefully implemented deep learning models with minimal effort can rival gradient-boosting decision trees.
 arXiv  Detail & Related papers  (2023-12-28T16:23:58Z)
- Federated Learning with Projected Trajectory Regularization [65.6266768678291]
 Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
 arXiv  Detail & Related papers  (2023-12-22T02:12:08Z)
- Robust Computer Vision in an Ever-Changing World: A Survey of Techniques
  for Tackling Distribution Shifts [20.17397328893533]
 AI applications are becoming increasingly visible to the general public.
There is a notable gap between the theoretical assumptions researchers make about computer vision models and the reality those models face when deployed in the real world.
One of the critical reasons for this gap is a challenging problem known as distribution shift.
 arXiv  Detail & Related papers  (2023-12-03T23:40:12Z)
- Deep networks for system identification: a Survey [56.34005280792013]
 System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
 arXiv  Detail & Related papers  (2023-01-30T12:38:31Z)
- Synthetic Model Combination: An Instance-wise Approach to Unsupervised
  Ensemble Learning [92.89846887298852]
 Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
 arXiv  Detail & Related papers  (2022-10-11T10:20:31Z)
- A Topological-Framework to Improve Analysis of Machine Learning Model
  Performance [5.3893373617126565]
 We propose a framework for evaluating machine learning models in which a dataset is treated as a "space" on which a model operates.
We describe a topological data structure, presheaves, which offer a convenient way to store and analyze model performance between different subpopulations.
 arXiv  Detail & Related papers  (2021-07-09T23:11:13Z)
- ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
  Learning Algorithms [58.684954492439424]
 We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
 arXiv  Detail & Related papers  (2021-04-11T12:14:04Z)
- Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
 We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
 arXiv  Detail & Related papers  (2021-01-16T23:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.