Related papers: Evaluating the Quality of Open Building Datasets for Mapping Urban Inequality: A Comparative Analysis Across 5 Cities

Evaluating the Quality of Open Building Datasets for Mapping Urban Inequality: A Comparative Analysis Across 5 Cities

URL: http://arxiv.org/abs/2508.12872v1
Date: Mon, 18 Aug 2025 12:14:57 GMT
Title: Evaluating the Quality of Open Building Datasets for Mapping Urban Inequality: A Comparative Analysis Across 5 Cities
Authors: Franz Okyere, Meng Lu, Ansgar Brunn,
Abstract summary: This study evaluates the quality and biases of AI-generated Open Building datasets generated by Google and Microsoft against OpenStreetMap (OSM) data.<n>The results indicate significant variance in data quality, with Houston and Berlin demonstrating high alignment and completeness.<n>There are gaps in the datasets analysed, and cities like Accra and Caracas may be under-represented.
Score: 1.4747234049753448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While informal settlements lack focused development and are highly dynamic, the quality of spatial data for these places may be uncertain. This study evaluates the quality and biases of AI-generated Open Building Datasets (OBDs) generated by Google and Microsoft against OpenStreetMap (OSM) data, across diverse global cities including Accra, Nairobi, Caracas, Berlin, and Houston. The Intersection over Union (IoU), overlap analysis and a positional accuracy algorithm are used to analyse the similarity and alignment of the datasets. The paper also analyses the size distribution of the building polygon area, and completeness using predefined but regular spatial units. The results indicate significant variance in data quality, with Houston and Berlin demonstrating high alignment and completeness, reflecting their structured urban environments. There are gaps in the datasets analysed, and cities like Accra and Caracas may be under-represented. This could highlight difficulties in capturing complex or informal regions. The study also notes different building size distributions, which may be indicative of the global socio-economic divide. These findings may emphasise the need to consider the quality of global building datasets to avoid misrepresentation, which is an important element of planning and resource distribution.

Related papers

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z)
Harnessing Rich Multi-Modal Data for Spatial-Temporal Homophily-Embedded Graph Learning Across Domains and Localities [2.5065738436850835]
This research proposes a heterogeneous data pipeline that performs cross-domain data fusion.<n>We aim to address complex urban problems across multiple domains and localities by harnessing the rich information over 50 data sources.
arXiv Detail & Related papers (2025-12-11T23:51:54Z)
Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts [20.100765943688454]
GRAM is a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions.<n>We use a million-scale satellite imagery dataset from 12 cities across four continents for source training.<n>During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions.
arXiv Detail & Related papers (2025-11-13T13:35:50Z)
Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection [13.550020274133866]
We propose re-training models at test time using synthetic data tailored to the target region's city layout.<n>This method generates geo-typical synthetic data that closely replicates the urban structure of a target area.<n>Experiments demonstrate significant performance enhancements, with median improvements of up to 12%, depending on the domain gap.
arXiv Detail & Related papers (2025-07-22T14:53:13Z)
Urban Forms Across Continents: A Data-Driven Comparison of Lausanne and Philadelphia [7.693465097015469]
This study presents a data-driven framework to identify and compare urban typologies across geographically and culturally distinct cities.<n>We extracted multidimensional features related to topography, multimodality, green spaces, and points of interest for the cities of Lausanne, Switzerland, and Philadelphia, USA.<n>The results reveal coherent and interpretable urban typologies within each city, with some cluster types emerging across both cities despite their differences in scale, density, and cultural context.
arXiv Detail & Related papers (2025-05-05T18:13:22Z)
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z)
Identifying every building's function in large-scale urban areas with multi-modality remote-sensing data [5.18540804614798]
This study proposes a semi-supervised framework to identify every building's function in large-scale urban areas. optical images, building height, and nighttime-light data are collected to describe the morphological attributes of buildings. Results are evaluated by 20,000 validation points and statistical survey reports from the government.
arXiv Detail & Related papers (2024-05-08T15:32:20Z)
Revisiting Link Prediction: A Data Perspective [59.296773787387224]
Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity.
arXiv Detail & Related papers (2023-10-01T21:09:59Z)
City Foundation Models for Learning General Purpose Representations from OpenStreetMap [16.09047066527081]
We present CityFM, a framework to train a foundation model within a selected geographical area of interest, such as a city. CityFM relies solely on open data from OpenStreetMap, and produces multimodal representations of entities of different types, spatial, visual, and textual information. In all the experiments, CityFM achieves performance superior to, or on par with, the baselines.
arXiv Detail & Related papers (2023-10-01T05:55:30Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
Robust Self-Tuning Data Association for Geo-Referencing Using Lane Markings [44.4879068879732]
This paper presents a complete pipeline for resolving ambiguities during the data association. Its core is a robust self-tuning data association that adapts the search area depending on the entropy of the measurements. We evaluate our method on real data from urban and rural scenarios around the city of Karlsruhe in Germany.
arXiv Detail & Related papers (2022-07-28T12:29:39Z)
SensatUrban: Learning Semantics from Urban-Scale Photogrammetric Point Clouds [52.624157840253204]
We introduce SensatUrban, an urban-scale UAV photogrammetry point cloud dataset consisting of nearly three billion points collected from three UK cities, covering 7.6 km2. Each point in the dataset has been labelled with fine-grained semantic annotations, resulting in a dataset that is three times the size of the previous existing largest photogrammetric point cloud dataset.
arXiv Detail & Related papers (2022-01-12T14:48:11Z)
Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges [52.624157840253204]
We present an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points. Our dataset consists of large areas from three UK cities, covering about 7.6 km2 of the city landscape. We evaluate the performance of state-of-the-art algorithms on our dataset and provide a comprehensive analysis of the results.
arXiv Detail & Related papers (2020-09-07T14:47:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.