Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features
- URL: http://arxiv.org/abs/2408.11384v1
- Date: Wed, 21 Aug 2024 07:26:43 GMT
- Title: Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features
- Authors: Hiba Najjar, Marlon Nuske, Andreas Dengel,
- Abstract summary: We leverage model explanation methods to identify the features crucial for the model to reach optimal performance.
Some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.
- Score: 5.143097874851516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The availability of temporal geospatial data in multiple modalities has been extensively leveraged to enhance the performance of machine learning models. While efforts on the design of adequate model architectures are approaching a level of saturation, focusing on a data-centric perspective can complement these efforts to achieve further enhancements in data usage efficiency and model generalization capacities. This work contributes to this direction. We leverage model explanation methods to identify the features crucial for the model to reach optimal performance and the smallest set of features sufficient to achieve this performance. We evaluate our approach on three temporal multimodal geospatial datasets and compare multiple model explanation techniques. Our results reveal that some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.
Related papers
- Plots Unlock Time-Series Understanding in Multimodal Models [5.792074027074628]
This paper proposes a method that leverages the existing vision encoders of multimodal foundation models to "see" time-series data via plots.
Our empirical evaluations show that this approach outperforms providing the raw time-series data as text.
To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks.
arXiv Detail & Related papers (2024-10-03T16:23:13Z) - Automated Label Unification for Multi-Dataset Semantic Segmentation with GNNs [48.406728896785296]
We propose a novel approach to automatically construct a unified label space across multiple datasets using graph neural networks.
Unlike existing methods, our approach facilitates seamless training without the need for additional manual reannotation or taxonomy reconciliation.
arXiv Detail & Related papers (2024-07-15T08:42:10Z) - Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective.
This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z) - Better, Not Just More: Data-Centric Machine Learning for Earth Observation [16.729827218159038]
We argue that a shift from a model-centric view to a complementary data-centric perspective is necessary for further improvements in accuracy, generalization ability, and real impact on end-user applications.
This work presents a definition as well as a precise categorization and overview of automated data-centric learning approaches for geospatial data.
arXiv Detail & Related papers (2023-12-08T19:24:05Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Benchmarking Data Efficiency and Computational Efficiency of Temporal
Action Localization Models [42.06124795143787]
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end.
This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power.
arXiv Detail & Related papers (2023-08-24T20:59:55Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training [44.790636524264]
Point Prompt Training is a novel framework for multi-dataset synergistic learning in the context of 3D representation learning.
It can overcome the negative transfer associated with synergistic learning and produce generalizable representations.
It achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training.
arXiv Detail & Related papers (2023-08-18T17:59:57Z) - Revealing the Underlying Patterns: Investigating Dataset Similarity,
Performance, and Generalization [0.0]
Supervised deep learning models require significant amount of labeled data to achieve an acceptable performance on a specific task.
We establish image-image, dataset-dataset, and image-dataset distances to gain insights into the model's behavior.
arXiv Detail & Related papers (2023-08-07T13:35:53Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.