SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated
- URL: http://arxiv.org/abs/2503.00168v2
- Date: Thu, 06 Mar 2025 09:23:35 GMT
- Title: SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated
- Authors: Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro,
- Abstract summary: SSL4EO-S12 v1.1 is a dataset designed for pretraining large-scale foundation models.<n>This dataset covers the world's 10,000 largest cities and its surroundings within a 50 km radius across four seasons.
- Score: 1.627786257389396
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This technical report presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12 v1.0, the new version addresses the previous challenges of data misalignment and a limited data structure for low-barrier, analysis-ready EO processing. SSL4EO-S12 v1.1 covers the world's 10,000 largest cities and its surroundings within a 50 km radius across four seasons, resulting in a diverse collection of nearly one million patches. SSL4EO-S12 v1.1 packages the data in Zarr file format for cloud-efficient loading and representation of meta-information such as including cloud masks and geolocation. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://datapub.fz-juelich.de/ssl4eo-s12, and we provided additional resources at https://github.com/DLR-MF-DAS/SSL4EO-S12-v1.1.
Related papers
- Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - Erasing the Bias: Fine-Tuning Foundation Models for Semi-Supervised Learning [4.137391543972184]
Semi-supervised learning (SSL) has witnessed remarkable progress, resulting in numerous method variations.
In this paper, we present a novel SSL approach named FineSSL that significantly addresses this limitation by adapting pre-trained foundation models.
We demonstrate that FineSSL sets a new state of the art for SSL on multiple benchmark datasets, reduces the training cost by over six times, and can seamlessly integrate various fine-tuning and modern SSL algorithms.
arXiv Detail & Related papers (2024-05-20T03:33:12Z) - On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - SSL4EO-L: Datasets and Foundation Models for Landsat Imagery [8.34029977985994]
The Landsat program is the longest-running Earth observation program in history, with 50+ years of data acquisition by 8 satellites.
Despite the increasing popularity of deep learning and remote sensing, the majority of researchers still use decision trees and random forests for Landsat image analysis.
This paper introduces SSL4EO-L, the first ever dataset designed for Self-Supervised Learning for Earth Observation for the Landsat family of satellites.
arXiv Detail & Related papers (2023-06-15T18:11:20Z) - A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends [82.64268080902742]
Self-supervised learning (SSL) aims to learn discriminative features from unlabeled data without relying on human-annotated labels.
SSL has garnered significant attention recently, leading to the development of numerous related algorithms.
This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions.
arXiv Detail & Related papers (2023-01-13T14:41:05Z) - SSL4EO-S12: A Large-Scale Multi-Modal, Multi-Temporal Dataset for
Self-Supervised Learning in Earth Observation [20.94411133447731]
Self-supervised pre-training bears potential to generate expressive representations without human annotation.
We share an unlabeled RS dataset SSL4EO-S12 to assemble a global, multimodal, and multi-seasonal corpus of satellite imagery.
arXiv Detail & Related papers (2022-11-13T23:38:27Z) - SDW-ASL: A Dynamic System to Generate Large Scale Dataset for Continuous
American Sign Language [0.0]
We release the first version of our ASL dataset, which contains 30k sentences, 416k words, a vocabulary of 18k words, in a total of 104 hours.
This is the largest continuous sign language dataset published to date in terms of video duration.
arXiv Detail & Related papers (2022-10-13T07:08:00Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.