Design and Evaluation of a Scalable Data Pipeline for AI-Driven Air Quality Monitoring in Low-Resource Settings
- URL: http://arxiv.org/abs/2508.14451v1
- Date: Wed, 20 Aug 2025 06:19:27 GMT
- Title: Design and Evaluation of a Scalable Data Pipeline for AI-Driven Air Quality Monitoring in Low-Resource Settings
- Authors: Richard Sserujongi, Daniel Ogenrwot, Nicholas Niwamanya, Noah Nsimbe, Martin Bbaale, Benjamin Ssempala, Noble Mutabazi, Raja Fidel Wabinyai, Deo Okure, Engineer Bainomugisha,
- Abstract summary: This paper presents the design, implementation, and evaluation of the AirQo data pipeline.<n>It is built using open-source technologies such as Apache Airflow, Apache Kafka, and Google BigQuery.<n>We demonstrate the pipeline's ability to ingest, transform, and distribute millions of air quality measurements monthly from over 400 monitoring devices.
- Score: 0.4681310436826459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing adoption of low-cost environmental sensors and AI-enabled applications has accelerated the demand for scalable and resilient data infrastructures, particularly in data-scarce and resource-constrained regions. This paper presents the design, implementation, and evaluation of the AirQo data pipeline: a modular, cloud-native Extract-Transform-Load (ETL) system engineered to support both real-time and batch processing of heterogeneous air quality data across urban deployments in Africa. It is Built using open-source technologies such as Apache Airflow, Apache Kafka, and Google BigQuery. The pipeline integrates diverse data streams from low-cost sensors, third-party weather APIs, and reference-grade monitors to enable automated calibration, forecasting, and accessible analytics. We demonstrate the pipeline's ability to ingest, transform, and distribute millions of air quality measurements monthly from over 400 monitoring devices while achieving low latency, high throughput, and robust data availability, even under constrained power and connectivity conditions. The paper details key architectural features, including workflow orchestration, decoupled ingestion layers, machine learning-driven sensor calibration, and observability frameworks. Performance is evaluated across operational metrics such as resource utilization, ingestion throughput, calibration accuracy, and data availability, offering practical insights into building sustainable environmental data platforms. By open-sourcing the platform and documenting deployment experiences, this work contributes a reusable blueprint for similar initiatives seeking to advance environmental intelligence through data engineering in low-resource settings.
Related papers
- DataScribe: An AI-Native, Policy-Aligned Web Platform for Multi-Objective Materials Design and Discovery [1.0713846107735632]
DataScribe is an AI-native, cloud-based materials discovery platform.<n>It unifies experimental and computational data through machine-actionable knowledge graphs.<n>By embedding optimization engines, machine learning, and unified access to public and private scientific data directly within the data infrastructure, DataScribe functions as a general-purpose application-layer backbone for laboratories of any scale.
arXiv Detail & Related papers (2026-01-12T19:59:39Z) - A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control [0.0]
We present a novel approach for easily integrable and data-efficient visual assembly control.<n>Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms.<n>The results demonstrate a time-saving pipeline for generating image data in manufacturing environments.
arXiv Detail & Related papers (2025-09-16T13:48:55Z) - GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups [0.0]
Camera traps have long been used by wildlife researchers to monitor and study animal behavior, population dynamics, habitat use, and species diversity in a non-invasive and efficient manner.<n>While data collection from the field has increased with new tools and capabilities, methods to develop, process, and manage the data, especially the adoption of ML/AI tools, remain challenging.<n>This paper provides a guide to a low-resource pipeline to process camera trap data on-premise, incorporating ML/AI capabilities tailored for small research groups with limited resources and computational expertise.
arXiv Detail & Related papers (2025-07-12T22:02:55Z) - Provenance Tracking in Large-Scale Machine Learning Systems [0.0]
y4ML is a tool designed to collect data in a format compliant with the W3C PROV and ProvProvML standards.<n>y4ML is fully integrated with the yProv framework, allowing for higher level pairing in tasks run also through workflow management systems.
arXiv Detail & Related papers (2025-07-01T14:10:02Z) - Enhancing Pavement Sensor Data Acquisition for AI-Driven Transportation Research [1.22995445255292]
This paper presents comprehensive guidelines for managing transportation sensor data.<n>It covers both archived static data and real-time data streams.<n>The proposals were applied to INDOT's real-world case studies involving the I-65 and I-69 Greenfield districts.
arXiv Detail & Related papers (2025-02-20T03:37:46Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - ESTemd: A Distributed Processing Framework for Environmental Monitoring
based on Apache Kafka Streaming Engine [0.0]
Distributed networks and real-time systems are becoming the most important components for the new computer age, the Internet of Things.
Data generated offers the ability to measure, infer and understand environmental indicators, from delicate ecologies to natural resources to urban environments.
We propose a distributed framework Event STream Processing Engine for Environmental Monitoring Domain (ESTemd) for the application of stream processing on heterogeneous environmental data.
arXiv Detail & Related papers (2021-04-02T15:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.