GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring
- URL: http://arxiv.org/abs/2602.07463v1
- Date: Sat, 07 Feb 2026 09:36:39 GMT
- Title: GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring
- Authors: Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Tayyaba Asif, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim,
- Abstract summary: We introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories.<n>This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation.<n>Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification.
- Score: 5.4998857381465465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.
Related papers
- An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment [2.723394443506285]
This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material.<n>Recent advancements in computer vision have significantly contributed to waste classification and recognition.<n>In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts.
arXiv Detail & Related papers (2026-02-14T09:07:00Z) - The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation [0.0]
The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources.<n>The dataset was benchmarked using state-of-the-art deep learning models.<n>Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost.
arXiv Detail & Related papers (2026-02-11T04:01:12Z) - EBES: Easy Benchmarking for Event Sequences [17.277513178760348]
Event Sequences (EvS) refer to sequential data characterized by irregular sampling intervals and a mix of categorical and numerical features.<n>EBES is a comprehensive benchmark for EvS classification with sequence-level targets.<n>It features standardized evaluation scenarios and protocols, along with an open-source PyTorch library that implements 9 modern models.
arXiv Detail & Related papers (2024-10-04T13:03:43Z) - WasteGAN: Data Augmentation for Robotic Waste Sorting through Generative Adversarial Networks [7.775894876221921]
We introduce a data augmentation method based on a novel GAN architecture called wasteGAN.
The proposed method allows to increase the performance of semantic segmentation models, starting from a very limited bunch of labeled examples.
We then leverage the higher-quality segmentation masks predicted from models trained on the wasteGAN synthetic data to compute semantic-aware grasp poses.
arXiv Detail & Related papers (2024-09-25T15:04:21Z) - Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training.
We propose a clustering-based approach for building ones satisfying all these criteria.
Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z) - SpectralWaste Dataset: Multimodal Data for Waste Sorting Automation [46.178512739789426]
We present SpectralWaste, the first dataset collected from an operational plastic waste sorting facility.
This dataset contains labels for several categories of objects that commonly appear in sorting plants.
We propose a pipeline employing different object segmentation architectures and evaluate the alternatives on our dataset.
arXiv Detail & Related papers (2024-03-26T18:39:38Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - VisDA 2022 Challenge: Domain Adaptation for Industrial Waste Sorting [61.52419223232737]
In industrial waste sorting, one of the biggest challenges is the extreme diversity of the input stream.
We present the VisDA 2022 Challenge on Domain Adaptation for Industrial Waste Sorting.
arXiv Detail & Related papers (2023-03-26T21:38:38Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.